rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

[dev] Optimize batch fetch method to boost throughput #269

Open NiuBlibing opened 1 year ago

NiuBlibing commented 1 year ago

Description

The previous start url fetching method only working when spider is idle, which is not full concurrency.This patch optimizes it by using request_left_downloader signal.

There maybe need a lock for calculating the need_size.

Fixes #119

How Has This Been Tested?

Test Configuration:

Checklist:

LuckyPigeon commented 1 year ago

@NiuBlibing Please resolve the assertion error. And add unit test for fill_requests_queue, thanks!

LuckyPigeon commented 1 year ago

@rmax How do you think about this implementation, it disabled spider_idel usage. I wonder if we need a switch between spider_idle and fill_requests_queue.

rmax commented 1 year ago

@rmax How do you think about this implementation, it disabled spider_idel usage. I wonder if we need a switch between spider_idle and fill_requests_queue.

Interesting the use of the other signal. What scrapy version is required for the new signal?

What happens with existing users that override the spider_idle method?

Does it make sense to bump the major version? Or somewhat related, shall we migrate to calendar versioning?