I am using Scrapy to scrape a large number of URLs that I read from a CSV file. However, I've noticed that not all URLs are being processed, particularly when the list is large. For example, if I include 200 URLs in the CSV, I only get results for about 150. With 100 URLs, it returns even fewer results, approximately 60.`
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
file_path = 'some_csv.csv'
df = pd.read_csv(file_path)
start_urls = df["website"].tolist()
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u,
errback=self.errback_httpbin,
dont_filter=True
)
def parse(self, response):
yield {
'url': response.url,
'body': response.body
}
Scrapy Settings: I'm using the default settings for Scrapy.
Logs/Error Messages: The logs don't show any specific errors related to the dropped requests, but I do see that not all requests are logged as processed
Environment:
Python version: 3.11.0
Scrapy version: 2.11.1
Operating system: Windows 11
Question: What could be causing Scrapy to not process all URLs, and how can I ensure that every URL from the CSV is addressed? Could there be an issue with how Scrapy handles large sets of URLs or potentially with how the requests are being managed?
I've checked to ensure the CSV does not contain invalid or duplicate URLs. I've also monitored the logs to see if there's any pattern to the URLs that aren't being processed, but nothing stands out.
I am using Scrapy to scrape a large number of URLs that I read from a CSV file. However, I've noticed that not all URLs are being processed, particularly when the list is large. For example, if I include 200 URLs in the CSV, I only get results for about 150. With 100 URLs, it returns even fewer results, approximately 60.`
Scrapy Settings: I'm using the default settings for Scrapy.
Logs/Error Messages: The logs don't show any specific errors related to the dropped requests, but I do see that not all requests are logged as processed
Environment:
Python version: 3.11.0
Scrapy version: 2.11.1
Operating system: Windows 11
Question: What could be causing Scrapy to not process all URLs, and how can I ensure that every URL from the CSV is addressed? Could there be an issue with how Scrapy handles large sets of URLs or potentially with how the requests are being managed?
I've checked to ensure the CSV does not contain invalid or duplicate URLs. I've also monitored the logs to see if there's any pattern to the URLs that aren't being processed, but nothing stands out.