scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
50.99k stars 10.34k forks source link

Scrapy Spider Fails to Process All URLs from CSV on Large URL Sets #6320

Closed mjid13 closed 2 weeks ago

mjid13 commented 2 weeks ago

I am using Scrapy to scrape a large number of URLs that I read from a CSV file. However, I've noticed that not all URLs are being processed, particularly when the list is large. For example, if I include 200 URLs in the CSV, I only get results for about 150. With 100 URLs, it returns even fewer results, approximately 60.`

import scrapy

class MySpider(scrapy.Spider):
    name = 'my_spider'
    file_path = 'some_csv.csv'
    df = pd.read_csv(file_path)
    start_urls = df["website"].tolist()

   def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u,
                                 errback=self.errback_httpbin,
                                 dont_filter=True
                                 )

    def parse(self, response):
        yield {
            'url': response.url,
            'body': response.body
        }

Scrapy Settings: I'm using the default settings for Scrapy.

Logs/Error Messages: The logs don't show any specific errors related to the dropped requests, but I do see that not all requests are logged as processed

Environment:

Python version: 3.11.0

Scrapy version: 2.11.1

Operating system: Windows 11

Question: What could be causing Scrapy to not process all URLs, and how can I ensure that every URL from the CSV is addressed? Could there be an issue with how Scrapy handles large sets of URLs or potentially with how the requests are being managed?

I've checked to ensure the CSV does not contain invalid or duplicate URLs. I've also monitored the logs to see if there's any pattern to the URLs that aren't being processed, but nothing stands out.

mjid13 commented 2 weeks ago

Could be my laptop CPU or memory the reason for the issue?