scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
52.42k stars 10.48k forks source link

Undocumented (performance) behaviour with default value of `DOWNLOAD_DELAY` setting (0) #6476

Open GeorgeA92 opened 6 days ago

GeorgeA92 commented 6 days ago

Description

On runs with default value of DOWNLOAD_DELAY setting (0) request sending rate.. limited only by CPU capabilities until number of sent requests will reach value CONCURRENT_REQUESTS_PER_DOMAIN (default - 8, or CONCURRENT_REQUESTS if it is's value lower). As far as I was in scrapy documentation - it didn't mentioned anywhere.

Steps to Reproduce

Here is the script that reproduce this:

script.py ```python import scrapy from scrapy.crawler import CrawlerProcess from scrapy.signals import request_reached_downloader class BooksToScrapeSpider(scrapy.Spider): name = "books" custom_settings = { "CONCURRENT_REQUESTS": 8 } def request_reached_downloader(self, request, spider): self.logger.info(f"request to {request.url}") def start_requests(self): self.crawler.signals.connect(self.request_reached_downloader, signal=request_reached_downloader) yield scrapy.Request(url='http://books.toscrape.com/catalogue/page-1.html', callback=self.parse) def parse(self, response): for book in response.css("h3 a::attr(href)").getall(): yield scrapy.Request(url=response.urljoin(book), callback=self.parse_product) #if next_page := response.css("li.next a::attr(href)").get(): # yield scrapy.Request(url=response.urljoin(next_page), callback=self.parse, priority=-10) def parse_product(self, response): response.css('li.current::text').get() pass if __name__ == "__main__": p = CrawlerProcess(settings={"LOG_DATEFORMAT": ""}); p.crawl(BooksToScrapeSpider); p.start() ```
log_output ``` 2024-09-12 16:19:15,967 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: scrapybot) 2024-09-12 16:19:16,000 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.6.0, w3lib 1.21.0, Twisted 22.10.0, Python 3.10.8 | packaged by conda-forge | (main, Nov 24 2022, 14:07:00) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 23.0.0 (OpenSSL 1.1.1w 11 Sep 2023), cryptography 39.0.1, Platform Windows-10-10.0.22631-SP0 2024-09-12 16:19:16,017 [scrapy.addons] INFO: Enabled addons: [] 2024-09-12 16:19:16,031 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2024-09-12 16:19:16,062 [scrapy.extensions.telnet] INFO: Telnet Password: b8049b496606c603 2024-09-12 16:19:16,317 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2024-09-12 16:19:16,317 [scrapy.crawler] INFO: Overridden settings: {'CONCURRENT_REQUESTS': 8, 'LOG_DATEFORMAT': ''} 2024-09-12 16:19:16,605 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2024-09-12 16:19:16,607 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2024-09-12 16:19:16,613 [scrapy.middleware] INFO: Enabled item pipelines: [] 2024-09-12 16:19:16,613 [scrapy.core.engine] INFO: Spider opened 2024-09-12 16:19:16,640 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2024-09-12 16:19:16,640 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2024-09-12 16:19:21,130 [books] INFO: request to http://books.toscrape.com/catalogue/page-1.html 2024-09-12 16:19:21,957 [scrapy.core.engine] DEBUG: Crawled (200) (referer: None) 2024-09-12 16:19:22,093 [books] INFO: request to http://books.toscrape.com/catalogue/scott-pilgrims-precious-little-life-scott-pilgrim-1_987/index.html 2024-09-12 16:19:22,093 [books] INFO: request to http://books.toscrape.com/catalogue/set-me-free_988/index.html 2024-09-12 16:19:22,093 [books] INFO: request to http://books.toscrape.com/catalogue/shakespeares-sonnets_989/index.html 2024-09-12 16:19:22,098 [books] INFO: request to http://books.toscrape.com/catalogue/starving-hearts-triangular-trade-trilogy-1_990/index.html 2024-09-12 16:19:22,098 [books] INFO: request to http://books.toscrape.com/catalogue/the-black-maria_991/index.html 2024-09-12 16:19:22,098 [books] INFO: request to http://books.toscrape.com/catalogue/the-boys-in-the-boat-nine-americans-and-their-epic-quest-for-gold-at-the-1936-berlin-olympics_992/index.html 2024-09-12 16:19:22,098 [books] INFO: request to http://books.toscrape.com/catalogue/the-coming-woman-a-novel-based-on-the-life-of-the-infamous-feminist-victoria-woodhull_993/index.html 2024-09-12 16:19:22,098 [books] INFO: request to http://books.toscrape.com/catalogue/the-dirty-little-secrets-of-getting-your-dream-job_994/index.html 2024-09-12 16:19:22,324 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,324 [books] INFO: request to http://books.toscrape.com/catalogue/its-only-the-himalayas_981/index.html 2024-09-12 16:19:22,539 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,540 [books] INFO: request to http://books.toscrape.com/catalogue/libertarianism-for-beginners_982/index.html 2024-09-12 16:19:22,549 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,549 [books] INFO: request to http://books.toscrape.com/catalogue/mesaerion-the-best-science-fiction-stories-1800-1849_983/index.html 2024-09-12 16:19:22,554 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,556 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,556 [books] INFO: request to http://books.toscrape.com/catalogue/olio_984/index.html 2024-09-12 16:19:22,559 [books] INFO: request to http://books.toscrape.com/catalogue/our-band-could-be-your-life-scenes-from-the-american-indie-underground-1981-1991_985/index.html 2024-09-12 16:19:22,559 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,565 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,568 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,571 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,574 [books] INFO: request to http://books.toscrape.com/catalogue/rip-it-up-and-start-again_986/index.html 2024-09-12 16:19:22,575 [books] INFO: request to http://books.toscrape.com/catalogue/the-requiem-red_995/index.html 2024-09-12 16:19:22,575 [books] INFO: request to http://books.toscrape.com/catalogue/sapiens-a-brief-history-of-humankind_996/index.html 2024-09-12 16:19:22,578 [books] INFO: request to http://books.toscrape.com/catalogue/sharp-objects_997/index.html 2024-09-12 16:19:22,761 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,761 [books] INFO: request to http://books.toscrape.com/catalogue/soumission_998/index.html 2024-09-12 16:19:22,767 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,773 [books] INFO: request to http://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html 2024-09-12 16:19:22,776 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,776 [books] INFO: request to http://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html 2024-09-12 16:19:22,786 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,791 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,793 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,796 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,796 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,972 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,987 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:22,987 [scrapy.core.engine] DEBUG: Crawled (200) (referer: http://books.toscrape.com/catalogue/page-1.html) 2024-09-12 16:19:23,101 [scrapy.core.engine] INFO: Closing spider (finished) 2024-09-12 16:19:23,101 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 7055, 'downloader/request_count': 21, 'downloader/request_method_count/GET': 21, 'downloader/response_bytes': 78821, 'downloader/response_count': 21, 'downloader/response_status_count/200': 21, 'elapsed_time_seconds': 6.460656, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2024, 9, 12, 14, 19, 23, 101159, tzinfo=datetime.timezone.utc), 'httpcompression/response_bytes': 407846, 'httpcompression/response_count': 21, 'items_per_minute': None, 'log_count/DEBUG': 22, 'log_count/INFO': 31, 'request_depth_max': 1, 'response_received_count': 21, 'responses_per_minute': None, 'scheduler/dequeued': 21, 'scheduler/dequeued/memory': 21, 'scheduler/enqueued': 21, 'scheduler/enqueued/memory': 21, 'start_time': datetime.datetime(2024, 9, 12, 14, 19, 16, 640503, tzinfo=datetime.timezone.utc)} 2024-09-12 16:19:23,101 [scrapy.core.engine] INFO: Spider closed (finished) ```

from log output we see that first 8 requests sent between 2024-09-12 16:19:22,093 and 2024-09-12 16:19:22,098 (5 milliseconds) on rate of 1600 rpm(on my hardware).

Any application with DOWNLOAD_DELAY=0 and some custom code that schedule new requests on spider_idle signal - will produce that "burst" of requests on rate over 100 rpm on each spider_idle signal event.

Reproduces how often: 100% if DOWNLOAD_DELAY is 0 (default value). Every scrapy application where default value of DOWNLOAD_DELAY didn't change - affected by this.

Additional context

Despite general recommendation about scraping politely - a lot of scrapy users are not aware of consequences of DOWNLOAD_DELAY=0 on runs with current default settings and didn't override this. This request sending pattern (8 very fast requests, 9th right after sending 1st response) - can be easily recognised by antibots.

Obvious way to 100% prevent this (as scrapy default behaviour) - is to make backward incompatible change and increase default value of DOWNLOAD_DELAY to bigger non-zero value. Counting default value of RANDOMIZE_DOWNLOAD_DELAY=True and it's following logic from:

https://github.com/scrapy/scrapy/blob/e8cb5a03b382b98f2c8945355076390f708b918d/scrapy/core/downloader/__init__.py#L41-L43

it should be increased to at least to 4

If scrapy user still wants to reduce DOWNLOAD_DELAY lower or to 0 and to face related consequences - it should be set explicitly by updating related settings by scrapy user.(It is not supposed to be default behaviour of scrapy as now)

wRAR commented 6 days ago

Yeah, we don't have a direct setting for "RPM limit" so it's kiiinda expected that we don't guarantee that requests are made slowly. But at least https://docs.scrapy.org/en/latest/topics/practices.html#avoiding-getting-banned mentions DOWNLOAD_DELAY.

it should be increased to at least to 4

Shouldn't it be dependent on concurrency settings and desired RPM or do you just mean it's good as a default with default concurrency settings? I wonder what will be the default RPM with an ideal server, (8 / 4) * 60 = 120?