rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.54k stars 1.59k forks source link

Crawl urls not from corresponding redis key #185

Closed nghuyong closed 3 years ago

nghuyong commented 3 years ago

I run two different spiders (spider_A and spider_B) and they have two different redis_key (spider_A:start_urls and spider_B:start_urls)

I find when I run spider_A first and then run spider_B, spider_A will crawl urls from spider_B's redis_key(spider_B:start_urls). I add logs to mark urls read from redis, in the spider_A's logs, urls all read from spider_A:start_urls and don't fond urls from spider_B:start_urls. But spider_A indeed crawled urls from spider_B!

spider_A settings:

# Ensure use this Scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Redis URL
REDIS_URL = 'redis://{}:{}'.format(REDIS_HOST, REDIS_PORT)

# Persist
SCHEDULER_PERSIST = True

REDIS_START_URLS_KEY =  "spider_A:start_urls" 

spider_B's settings is the same with spider_A, except that REDIS_START_URLS_KEY = "spider_B:start_urls"

log code to mark urls read from redis:

def next_requests(self):
    """Returns a request to be scheduled or none."""
    use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', False)
    fetch_one = self.server.spop if use_set else self.server.lpop
    # XXX: Do we need to use a timeout here?
    found = 0
    while found < self.redis_batch_size:
        data = fetch_one(self.redis_key)
        # here is the log code
        self.logger.debug(f'read from {self.redis_key}, {data}')
        if not data:
            break
        req = self.make_request_from_data(data)
        if req:
            yield req
            found += 1
        else:
            self.logger.debug("Request not made from data: %r", data)