scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

Allow disabling the AutoThrottle extension for a given slot #6246

Closed Gallaecio closed 2 months ago

codecov[bot] commented 2 months ago

Codecov Report

Merging #6246 (d357689) into master (2d46b4a) will increase coverage by 0.00%. The diff coverage is 100.00%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #6246 +/- ## ======================================= Coverage 88.90% 88.90% ======================================= Files 161 161 Lines 11790 11792 +2 Branches 1913 1913 ======================================= + Hits 10482 10484 +2 Misses 980 980 Partials 328 328 ``` | [Files](https://app.codecov.io/gh/scrapy/scrapy/pull/6246?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy) | Coverage Δ | | |---|---|---| | [scrapy/core/downloader/\_\_init\_\_.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6246?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2NvcmUvZG93bmxvYWRlci9fX2luaXRfXy5weQ==) | `91.83% <100.00%> (+0.11%)` | :arrow_up: | | [scrapy/extensions/throttle.py](https://app.codecov.io/gh/scrapy/scrapy/pull/6246?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=scrapy#diff-c2NyYXB5L2V4dGVuc2lvbnMvdGhyb3R0bGUucHk=) | `100.00% <100.00%> (ø)` | |
GeorgeA92 commented 2 months ago

@Gallaecio I didn't notice this earlier. As a developer who was responsible for preparing/testing actual implementation of DOWNLOAD_SLOTS I don't like approach to implement compatibility with AutoThrottle extension from this Pull request.

During preparation of https://github.com/scrapy/scrapy/pull/5328 compatibility with Autothrottle extenstion didn't mentioned as requirement. It tested on default settings (with disabled autothrottle extension)

Updating AutoThrottle extension to make it compatible with DOWNLOAD_SLOT setting values - makes sense only if all of its settings AUTOTHROTTLE_TARGET_CONCURRENCY, AUTOTHROTTLE_ENABLED, AUTOTHROTTLE_START_DELAY, AUTOTHROTTLE_MAX_DELAY will be possible to set on per download slot basis (for example by just adding related supported keys for exiting DOWNLOAD_SLOTS setting) and from scrapy user perspective most likely this will be expected behaviour for DOWNLOAD_SLOTS with enabled AutoThrottle extention.

While disabling throttle on specific slot +- looks like applying of AUTOHTROTTLE_ENABLED only.

And there is no need to.. make backward incompatible changes to scrapy/core downloader slot to implement this.

Technically nothing prevents us to update downloader.per_slot_settings dict used for slots initially listed on DOWNLOAD_SLOTS setting during runtime from Autothrottle extension - as this extension affect downloader variables directly already

Gallaecio commented 2 months ago

So, what you are suggesting is to implement an AutoThrottle-specific setting, e.g. AUTOTHROTTLE_SKIP_SLOTS? That sounds better to me indeed for the scenario mentioned in the docs change here.

However, my end goal here, the reason why I implemented this, is to be able to disable AutoThrottle on slots created by scrapy-zyte-api, so that we can implement latency reporting without triggering AutoThrottle. For that, I don’t think a setting would be a good choice.

I am completely open to suggestions, though. I really do not like modifying a core component to control an extension behavior, it is simply the cleanest thing I could come up with.

GeorgeA92 commented 2 months ago

@Gallaecio

So, what you are suggesting is to implement an AutoThrottle-specific setting, e.g. AUTOTHROTTLE_SKIP_SLOTS? That sounds better to me indeed for the scenario mentioned in the docs change here.

I mean to something like "quotes.toscrape.com": {"concurrency": 1, "delay": 2, "randomize_delay": False, "AUTOTHROTTLE_START_DELAY": 5 , "AUTOTHROTTLE_MAX_DELAY":25}, etc.

However, my end goal here, the reason why I implemented this, is to be able to disable AutoThrottle on slots created by scrapy-zyte-api, so that we can https://github.com/scrapy-plugins/scrapy-zyte-api/pull/59#discussion_r1020056364. For that, I don’t think a setting would be a good choice.

This will require to.. update params for all download slots with _slot_prefix = "zyte-api@" which is not implemented in DOWNLOAD_SLOTS (will require to hardcode all expected slots with prefixed).

I am completely open to suggestions, though. I really do not like modifying a core component to control an extension behavior, it is simply the cleanest thing I could come up with.

My suggestion is to update.. method of AutoThrottle extension _adjust_delay or _response_downloaded to not make changes if slot has _slot_prefix. I mean changing this during runtime on active AutoThrottle instance fromScrapyZyteAPIDownloaderMiddleware.open_spider it's spider_opened signal handler. It can be:

  1. monkeypathing of _adjust_delay method.
  2. diconnect AutoThrottle._response_downloaded signal handler and connect it's updated method (I did similar tricks with signal handlers before)

This approach doesn't require to change both downloader and autothrottle ext (only zyte api middleware code required to update)

Gallaecio commented 2 months ago

What about exposing a AUTOTHROTTLE_SLOT_FILTER setting that accepts a Scrapy component for filtering, and set that from scrapy-zyte-api to a component that filters out our slots?

GeorgeA92 commented 2 months ago

@Gallaecio If end goal of this to affect downloader slots from zyteapi middleware/download hanlder - then (if possible - I think it is possible) only zyteapi ext. related code needs to be updated or fixed (with no additional changes to scrapy itself) - this is logic behind approach I proposed in https://github.com/scrapy/scrapy/pull/6246#issuecomment-1998697987

What about exposing a AUTOTHROTTLE_SLOT_FILTER setting that accepts a Scrapy component for filtering, and set that from scrapy-zyte-api to a component that filters out our slots?

With approach I proposed earlier - for end user It will be possible to receive expected result only after updating version of zyteapi middleware. This approach (adding another setting on scrapy code level) as well as updates from this PR also require to update scrapy to the latest version(that include this change) that.. may not be possible for some projects.

Gallaecio commented 2 months ago

only zyteapi ext. related code needs to be updated

Yes, but if I understood correctly, you are suggesting to monkey-patch Scrapy code from the extension, which I would like to avoid.

require to update scrapy to the latest version

I think this is OK, since the only thing we need this for is to support latency reporting for responses coming for scrapy-zyte-api, and most users will probably not care about that.

GeorgeA92 commented 1 month ago

@Gallaecio

I think this is OK, since the only thing we need this for is to support latency reporting for responses coming for scrapy-zyte-api, and most users will probably not care about that.

As far as I know scrapy-zyte-api (handler) doesn't set download_latency as regular scrapy download handlers do - If yes then I'd prefer to not change that (or set download_latency values to.. some other meta key). Autothrottle should't make any changes to slot if download_latency meta value doesn't exist in request (which means that autothrottle extenstion shouldn't affect requests originated from scrapy-zyte-api)

https://github.com/scrapy/scrapy/blob/02b97f98e74a994ad3e4d74e7ed55207e508a576/scrapy/extensions/throttle.py#L64-L66

Scrapy-zyte-api(handler) has it's own logic that duplicate logic of retrymiddleware and autothrottle extension (as mentioned on https://github.com/scrapy-plugins/scrapy-zyte-api/issues/99#issuecomment-1628946455. What if for single scrapy request - zyteapi download handler will make 4 retries and 1 200s (5 requests) with latencies like [2.5, 6, 0.24, 1.5, 60] ? Autothrottle extension can accept download_latency as single number only

In this case I propose to.. update scrapy-zyte-api.. to write (or to not write) something in download_latency request meta key as trigger to disable/enable autothrottle for that requests - this approach looks more conventional and I still don't see a reason to make backward incompatible changes to scrapy core parts

Gallaecio commented 1 month ago

As far as I know scrapy-zyte-api (handler) doesn't set download_latency as regular scrapy download handlers do

This was not done on purpose, though, it was something we accidentally forgot to implement, i.e. a bug. And by the time we realized it, we were already exploiting this bug to prevent AutoThrottle from affecting Zyte API traffic. But we do want to fix the bug, i.e. implement download_latency, only we need to first figure out a way to handle AutoThrottle that does not depend on download_latency not being defined.

What if for single scrapy request - zyteapi download handler will make 4 retries and 1 200s (5 requests) with latencies like [2.5, 6, 0.24, 1.5, 60] ? Autothrottle extension can accept download_latency as single number only

I think this is fine. download_latency is the time the request spends in the download handler. If a request takes longer because of 429 or 521 retries, so be it, and the download_latency would in practice be the sum of the time every attempt took until success or exceeding retries.

I propose to.. update scrapy-zyte-api.. to write (or to not write) something in download_latency request meta key as trigger to disable/enable autothrottle for that requests

I think putting a value in download_latency other than the time the request spent in the download handler would not be a good choice. It would break the expectations of user code using download_latency about what download_latency stands for.