rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE. #193

Closed nieweiming closed 3 years ago

nieweiming commented 3 years ago

新增空闲最大等待时间MAX_IDLE_TIME_BEFORE_CLOSE. 在设置中使用MAX_IDLE_TIME_BEFORE_CLOSE来表示最大的等待秒数. 不设置或为0时,则会一直等待. MAX_IDLE_TIME_BEFORE_CLOSE不会影响SCHEDULER_IDLE_BEFORE_CLOSE的使用.

Added maximum idle waiting time MAX_IDLE_TIME_BEFORE_CLOSE. Use MAX_IDLE_TIME_BEFORE_CLOSE in the settings to indicate the maximum number of seconds to wait. If it is not set or 0, it will wait forever. MAX_IDLE_TIME_BEFORE_CLOSE will not affect the use of SCHEDULER_IDLE_BEFORE_CLOSE.

rmax commented 3 years ago

Thanks for taking the time to send the PR.

How is this different from using SCHEDULER_IDLE_BEFORE_CLOSE setting? See https://github.com/rmax/scrapy-redis#usage

That feature uses a blocking redis operation to wait for the next request https://github.com/rmax/scrapy-redis/blob/fff0d8279e021600537cc8645e63263ad99887c0/src/scrapy_redis/scheduler.py#L163-L164

nieweiming commented 3 years ago
class RedisMixin(object):
    def setup_redis(self, crawler=None):
        ...
        self.server = connection.from_settings(crawler.settings)
        # The idle signal is called when the spider has no requests left,
        # that's when we will schedule new requests from redis queue
        crawler.signals.connect(self.spider_idle, signal=signals.spider_idle)

    def schedule_next_requests(self):
        """Schedules a request if available"""
        # TODO: While there is capacity, schedule a batch of redis requests.
        for req in self.next_requests():
            self.crawler.engine.crawl(req, spider=self)

    def spider_idle(self):
        """Schedules a request if available, otherwise waits."""
        # XXX: Handle a sentinel to close the spider.
        self.schedule_next_requests()
        raise DontCloseSpider

SCHEDULER_IDLE_BEFORE_CLOSE will not stop the crawler, because DontCloseSpider is always thrown, So I hope that when the queue is idle for a period of time, it can end by itself. The task is completed but in the running state, which means the occupation of resources;

rmax commented 3 years ago

Oh, please update the readme too with this new setting 🚀