scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.29k stars 10.37k forks source link

Undeprecate and add back to defaults the off-domain spider middleware #6364

Closed Gallaecio closed 2 weeks ago

Gallaecio commented 2 weeks ago

I wonder if we should keep both middlewares, and have both enabled, because spider middleware allows all these requests to skip the scheduler.

https://github.com/scrapy/scrapy/pull/6358#issuecomment-2110774163

Gallaecio commented 2 weeks ago

@kmike I just remembered why I did not do that in the first place: I extended the request_scheduled signal to support IgnoreRequest, so the new downloader middleware also makes off-domain requests skip the scheduler:

https://github.com/scrapy/scrapy/blob/a40e7e5fade8c3696e6e6b5fe06306c738303dd5/scrapy/downloadermiddlewares/offsite.py#L17

kmike commented 2 weeks ago

Nice solution 👍