Open GeorgeA92 opened 6 days ago
Yeah, we don't have a direct setting for "RPM limit" so it's kiiinda expected that we don't guarantee that requests are made slowly. But at least https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned mentions DOWNLOAD_DELAY
.
it should be increased to at least to 4
Shouldn't it be dependent on concurrency settings and desired RPM or do you just mean it's good as a default with default concurrency settings? I wonder what will be the default RPM with an ideal server, (8 / 4) * 60 = 120?
Description
On runs with default value of
DOWNLOAD_DELAY
setting (0) request sending rate.. limited only by CPU capabilities until number of sent requests will reach valueCONCURRENT_REQUESTS_PER_DOMAIN
(default - 8, orCONCURRENT_REQUESTS
if it is's value lower). As far as I was in scrapy documentation - it didn't mentioned anywhere.Steps to Reproduce
Here is the script that reproduce this:
script.py
```python import scrapy from scrapy.crawler import CrawlerProcess from scrapy.signals import request_reached_downloader class BooksToScrapeSpider(scrapy.Spider): name = "books" custom_settings = { "CONCURRENT_REQUESTS": 8 } def request_reached_downloader(self, request, spider): self.logger.info(f"request to {request.url}") def start_requests(self): self.crawler.signals.connect(self.request_reached_downloader, signal=request_reached_downloader) yield scrapy.Request(url='http://books.toscrape.com/catalogue/page-1.html', callback=self.parse) def parse(self, response): for book in response.css("h3 a::attr(href)").getall(): yield scrapy.Request(url=response.urljoin(book), callback=self.parse_product) #if next_page := response.css("li.next a::attr(href)").get(): # yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse, priority=-10) def parse_product(self, response): response.css('li.current::text').get() pass if __name__ == "__main__": p = CrawlerProcess(settings={"LOG_DATEFORMAT": ""}); p.crawl(BooksToScrapeSpider); p.start() ```log_output
``` 2024-09-12 16:19:15,967 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot) 2024-09-12 16:19:16,000 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w 11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22631-SP0 2024-09-12 16:19:16,017 [scrapy.addons] INFO: Enabled addons: [] 2024-09-12 16:19:16,031 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2024-09-12 16:19:16,062 [scrapy.extensions.telnet] INFO: Telnet Password: b8049b496606c603 2024-09-12 16:19:16,317 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2024-09-12 16:19:16,317 [scrapy.crawler] INFO: Overridden settings: {'CONCURRENT_REQUESTS': 8, 'LOG_DATEFORMAT': ''} 2024-09-12 16:19:16,605 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-09-12 16:19:16,607 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-09-12 16:19:16,613 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-09-12 16:19:16,613 [scrapy.core.engine] INFO: Spider opened 2024-09-12 16:19:16,640 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-09-12 16:19:16,640 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-09-12 16:19:21,130 [books] INFO: request to http://books.toscrape.com/catalogue/page-1.html 2024-09-12 16:19:21,957 [scrapy.core.engine] DEBUG: Crawled (200)from log output we see that first 8 requests sent between 2024-09-12 16:19:22,093 and 2024-09-12 16:19:22,098 (5 milliseconds) on rate of 1600 rpm(on my hardware).
Any application with
DOWNLOAD_DELAY=0
and some custom code that schedule new requests onspider_idle
signal - will produce that "burst" of requests on rate over 100 rpm on eachspider_idle
signal event.Reproduces how often: 100% if
DOWNLOAD_DELAY
is0
(default value). Every scrapy application where default value ofDOWNLOAD_DELAY
didn't change - affected by this.Additional context
Despite general recommendation about scraping politely - a lot of scrapy users are not aware of consequences of
DOWNLOAD_DELAY=0
on runs with current default settings and didn't override this. This request sending pattern (8 very fast requests, 9th right after sending 1st response) - can be easily recognised by antibots.Obvious way to 100% prevent this (as scrapy default behaviour) - is to make backward incompatible change and increase default value of
DOWNLOAD_DELAY
to bigger non-zero value. Counting default value ofRANDOMIZE_DOWNLOAD_DELAY=True
and it's following logic from:https://github.com/scrapy/scrapy/blob/e8cb5a03b382b98f2c8945355076390f708b918d/scrapy/core/downloader/__init__.py#L41-L43
it should be increased to at least to
4
If scrapy user still wants to reduce
DOWNLOAD_DELAY
lower or to0
and to face related consequences - it should be set explicitly by updating related settings by scrapy user.(It is not supposed to be default behaviour of scrapy as now)