rmax / scrapy-redis

Redis-based components for Scrapy.
http://scrapy-redis.readthedocs.io
MIT License
5.53k stars 1.59k forks source link

Read one entry from redis queue #136

Open t75bernd opened 5 years ago

t75bernd commented 5 years ago

Hello,

first of all: great project, and easy to use.

I have a suggestion to make for solving a problem i have, let me explain it:

  1. Add 2 urls to a queue for one spider
  2. Spider reads 2 requests and yields both of them as a request
  3. Now sometimes the following is happening in my case: Both requests are yield and i get a session key in both requests. The first one finished returns a bunch of new requests (yield a new request is not possible) and the second yielded request is waiting until the first yielded request has crawled an item (could last some minutes). In the meantime the session key of my second request is expired and it get's rejected when trying to make a new request.

My idea would be to have an attribute in the spider which allows me to define that only 1 item from the queue is read and yielded as a response.

class MySpider(RedisSpider):
    yield_1_request = True

And next_requests in spiders.py has to be changed to something like:

if req:
    yield req
    found += 1
    if hasattr(self, 'yield_1_request') and self.yield_1_request and not use_set:
        break
else:
    self.logger.debug("Request not made from data: %r", data)

In this way you still could decide for every spider if this is necessary and also it is not a big impact in the code. What do you think about the idea/implementation? Is there anything i could make better? I would also prepare a pr when this is accepted as a feature.

Thanks for your help in advance :)

LuckyPigeon commented 1 year ago

Thanks for your feedback, it's a pratical idea. I'll look into it and put it into our comming primary progress.