I run two different spiders (spider_A and spider_B) and they have two different redis_key (spider_A:start_urls and spider_B:start_urls)
I find when I run spider_A first and then run spider_B, spider_A will crawl urls from spider_B's redis_key(spider_B:start_urls).
I add logs to mark urls read from redis, in the spider_A's logs, urls all read from spider_A:start_urls and don't fond urls from spider_B:start_urls.
But spider_A indeed crawled urls from spider_B!
spider_A settings:
# Ensure use this Scheduler
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# Ensure all spiders share same duplicates filter through redis
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# Redis URL
REDIS_URL = 'redis://{}:{}'.format(REDIS_HOST, REDIS_PORT)
# Persist
SCHEDULER_PERSIST = True
REDIS_START_URLS_KEY = "spider_A:start_urls"
spider_B's settings is the same with spider_A, except that REDIS_START_URLS_KEY = "spider_B:start_urls"
log code to mark urls read from redis:
def next_requests(self):
"""Returns a request to be scheduled or none."""
use_set = self.settings.getbool('REDIS_START_URLS_AS_SET', False)
fetch_one = self.server.spop if use_set else self.server.lpop
# XXX: Do we need to use a timeout here?
found = 0
while found < self.redis_batch_size:
data = fetch_one(self.redis_key)
# here is the log code
self.logger.debug(f'read from {self.redis_key}, {data}')
if not data:
break
req = self.make_request_from_data(data)
if req:
yield req
found += 1
else:
self.logger.debug("Request not made from data: %r", data)
I run two different spiders (spider_A and spider_B) and they have two different redis_key (spider_A:start_urls and spider_B:start_urls)
I find when I run spider_A first and then run spider_B, spider_A will crawl urls from spider_B's redis_key(spider_B:start_urls). I add logs to mark urls read from redis, in the spider_A's logs, urls all read from
spider_A:start_urls
and don't fond urls fromspider_B:start_urls
. But spider_A indeed crawled urls from spider_B!spider_A settings:
spider_B's settings is the same with spider_A, except that
REDIS_START_URLS_KEY = "spider_B:start_urls"
log code to mark urls read from redis: