scrapy / scrapy

Scrapy, a fast high-level web crawling & scraping framework for Python.
https://scrapy.org
BSD 3-Clause "New" or "Revised" License
51.16k stars 10.35k forks source link

General Message Queues as Storage for Requests #4326

Open whalebot-helmsman opened 4 years ago

whalebot-helmsman commented 4 years ago

It is common request to use external message queues as a storage for scrapy requests (e.g. https://github.com/scrapy/scrapy/issues/3723). There are several implementations exists:

There is nothing special about these. They just came up as first results on a github for scrapy <message_queue_name> request.

Not so long ago integration of non-disk and non-memory queues into scrapy required separate scheduler. Improvements for scrapy scheduler weren't shared across these implementations.

After merge of https://github.com/scrapy/scrapy/pull/3884 adding different types of queues is much easier and support requires less maintenance.

There is no predefined list for message queues. Redis is a must to have. Others are optional.

never2average commented 4 years ago

@whalebot-helmsman @Gallaecio, can I be assigned to work on this issue?

akshaysharmajs commented 4 years ago

I want to contribute to this issue for GSOC2020 scrapinghub. Just have some questions in my mind

  1. Do we have to implement it using [scrapy-redis] OR build a new python library for scrapy to generate redis-based message queues shareable across scrapy?
  2. Which priority handling will be used for requests? @Gallaecio @whalebot-helmsman
Gallaecio commented 4 years ago

@never2average: This issue is up for the taking for any Google Summer of Code student. But we don’t assign them to specific students.

Students must write their own proposal through the Google platform, and multiple students may write a proposal for the same idea, just as students can submit multiple proposals for different ideas, even their own ideas.

If you are interested in working on this as a Google Summer of Code student, please carefully read http://gsoc2020.scrapinghub.com/participate

@AKSHAYSHARMAJS: From what @whalebot-helmsman wrote, I believe the goal is to create a single, brand-new library that offers integration with Scrapy for multiple message queuing technologies as scheduling queues.

@whalebot-helmsman is the expert here, so don’t take my word for it, but I think the goal is to implement such support by defining disk queue classes for the Scrapy scheduler. So you could combine the built-in scheduler priority queue classes of Scrapy with these new disk queue classes that would work with message queues. That’s my take after a careful read of https://github.com/scrapy/scrapy/pull/3884 (not an easy read!)

whalebot-helmsman commented 4 years ago

Let's describe idea in more details. We have a stream of scheduling:

In https://github.com/scrapy/scrapy/pull/3884 clear separation between them was presented. New interface as a result for storing/retrieving queues:

def __init__(self, crawler, key)

@classmethod
def from_crawler(cls, crawler, key, *args, **kwargs)

def push(self, request)

def pop(self)

more details in https://github.com/scrapy/scrapy/blob/master/scrapy/squeues.py

Result of work in this GSoC project would be implementation of storing/retrieving queues for different common message queues conforming this interface.

There is no predefined list for message queues. Redis is a must to have. Others are optional.

We have no requirement to use existent libraries, but looking at their code should help.

akshaysharmajs commented 4 years ago

@whalebot-helmsman Thanks for the explanation. My doubts are somewhat cleared now and I will send a proper proposal once applications period starts.

faizan2700 commented 4 years ago

Result of work in this GSoC project would be implementation of storing/retrieving queues for different common message queues conforming this interface.

There is no predefined list for message queues. Redis is a must to have. Others are optional.

We have no requirement to use existent libraries, but looking at their code should help.

@whalebot-helmsman when we are implementing other queues can we use their interface libraries like redis-py library for redis? (which libraries did you mentioned in last line that we have no requirement to use existent libraries)

whalebot-helmsman commented 4 years ago

@faizan2700 We talk about two different types of message queues' libraries - interface and scarpy-connection. E.g. redis-py is an interface library, https://github.com/rmax/scrapy-redis is a scrapy-connection library. We should use interface libraries and not write our own. It is not required to use scrapy-connection libraries.

Lukas0907 commented 4 years ago

I'm very interested in working on this issue as a GSoC project. I have prepared a draft of my application on Google Docs: https://docs.google.com/document/d/1ejSo4aOlcnZvaEiPjHTB7oGC_QGZ7HuVrWrHWF_qCaQ/edit?usp=sharing

I'd be very happy about feedback, especially about the scope of the project and the timeline.

Gallaecio commented 4 years ago

@Lukas0907 Looks really good to me!

It surprised me when I read about creating a new type of queue, but then you suggested merging disk queues and the new queue type together, so :+1:.

I do wonder what kind of changes to disk queues you have in mind, and if it would not be better to start by making those changes, and then start implementing the Redis-based queue (switching weeks 1 and 2).

akshaysharmajs commented 4 years ago

Is necessary to use only pickle for serializing or it would be better to let the users decide which serializer to choose as per following choices: scrapy.squeues.PickleFifoRedis-Queue, scrapy.squeues.PickleLifoRedis-Queue, scrapy.squeues.MarshalFifoRedis-Queue, scrapy.squeues.MarshalLifoRedis- Queue

whalebot-helmsman commented 4 years ago

@Lukas0907

Thanks for detailed proposal.

You propose next way[1] of implementing external messages

DiskQueues[done]----------------------------------|
                                                  |
                ExternalQueues[todo]--------------|-----> UnitedQueues[destination]

I thought about ExternalQueues as a special case of DiskQueues[2]. What are advantages of going [1] way comparatively to [2]?

In your document you mentions ZeroMQ. ZeroMQ is not a message queue. It is a protocol and concurrency framework. It doesn't backed by storage. I don't think we can use it for storing requests.

whalebot-helmsman commented 4 years ago

Is necessary to use only pickle for serializing or it would be better to let the users decide which serializer to choose as per following choices:

It isn't necessary. What advantages of using other serializers?

Lukas0907 commented 4 years ago

@Gallaecio

Thank you! :)

I do wonder what kind of changes to disk queues you have in mind, and if it would not be better to start by making those changes, and then start implementing the Redis-based queue (switching weeks 1 and 2).

You are right, that might be better and would make more sense. I will change it in the proposal. Thanks for the feedback.

@whalebot-helmsman

I thought about ExternalQueues as a special case of DiskQueues[2]. What are advantages of going [1] way comparatively to [2]?

I think the main problem is that a disk queue leaks the fact that it's backed by the file system in its "interface": To enable a disk queue, one has to use the JOBDIR setting and the directory that JOBDIR points to, is created. This happens regardless of how the actual implementation of the disk queue class looks like. However, for an external queue like Redis, the job directory is not needed and hence using JOBDIR as a flag does not make much sense.

That's why I was thinking that using the existing disk queue interface is inappropriate for Redis and that's why we have to change it slightly so that it's more generic (hence disk-based queues being a special case of external queues).

In your document you mentions ZeroMQ. ZeroMQ is not a message queue. It is a protocol and concurrency framework. It doesn't backed by storage. I don't think we can use it for storing requests.

Ah, thank you, ZeroMQ is probably not suitable then. I have not really researched how suitable other message queues are, to be honest. Since Redis is the number one requirement, I think it might be a good idea to postpone the research until the Redis implementation is finished to see what the actual requirements for an external data store are.

akshaysharmajs commented 4 years ago

I have shared a link to my gsoc application through email to the respected mentors (@whalebot-helmsman , @Gallaecio ). Please review it and suggest changes. I would be happy to get a feedback :)

whalebot-helmsman commented 4 years ago

To enable a disk queue, one has to use the JOBDIR setting and the directory that JOBDIR points to, is created. This happens regardless of how the actual implementation of the disk queue class looks like. However, for an external queue like Redis, the job directory is not needed and hence using JOBDIR as a flag does not make much sense.

Yes, we don't store requests on disk in case of external message queue. But we still need JOBDIR for storing states of all involved message queues https://github.com/scrapy/scrapy/blob/0ee04e1e91f42d7fdd69f20b00a06e7856cdc919/scrapy/core/scheduler.py#L83-L87 . In most simple case for redis we want to store list of all topics storing requests.

Lukas0907 commented 4 years ago

Ah, I see what you mean now. In case of Redis, we could also use Redis to store that information, right? But for other message queues that might not work...

whalebot-helmsman commented 4 years ago

we could also use Redis to store that information, right?

Redis instance may store all type of key/value pairs. Some pairs may be unrelated to any spider or several spiders may share same instance. Spider need some kind of a token to filter out unrelated key/value pairs. Storing this token on disk looks reasonable even in redis case.

Lukas0907 commented 4 years ago

That's true. I will update my proposal! Thanks for the feedback.

akshaysharmajs commented 4 years ago

@whalebot-helmsman as per your mail, lets take the discussion further here. The approach of implementing message queues (mentioned in proposal) is a basic idea that I thought it can be implemented in this way. Is that the right approach or any changes needed?

Gallaecio commented 4 years ago

The approach of implementing message queues (mentioned in proposal) is a basic idea that I thought it can be implemented in this way. Is that the right approach or any changes needed?

Could you link your draft proposal or explain what you mention here?

whalebot-helmsman commented 4 years ago

@AKSHAYSHARMAJS What is a advantage of creating new type of message queue SCHEDULER_MESSAGE_QUEUE instead of implementing message queues on top of SCHEDULER_DISK_QUEUE?

There is a mention of kombu in your proposal. Is it https://kombu.readthedocs.io/en/latest/introduction.html ? Kombu already supports redis, AMQP and many others. Do we still need non-kombu support for other message queues?

akshaysharmajs commented 4 years ago

Kombu already supports redis, AMQP and many others. Do we still need non-kombu support for other message queues?

Yes, Kombu supports redis and AMQP. Thats why I thought after implementing redis or possibly other message queues, we can add kombu support for automatic encoding, serialization and compression of messages. But currently, we need non-kombu support for message queues.

akshaysharmajs commented 4 years ago

What is a advantage of creating new type of message queue SCHEDULER_MESSAGE_QUEUE instead of implementing message queues on top of SCHEDULER_DISK_QUEUE?

Thanks! @whalebot-helmsman Earlier, I thought maybe we can add a SCHEDULER_MESSAGE_QUEUE setting to avoid creating job directory but we need states of message queues for pausing and resuming crawls as mentioned here. So, message queue should be implemented on top of SCHEDULER_DISK_QUEUE. I will change that in my proposal : )

faizan2700 commented 4 years ago

Hello @whalebot-helmsman I just wanted to ask if there is some feature which did not get much applications as I have started writing my application for gsoc for Message Queues feature (this one) I wanted to apply for something else too in scrapy just in case. I have been reading codebase of scrapy for a while and I am familiar with codebase of scrapy.

whalebot-helmsman commented 4 years ago

@faizan2700 I asked in internal chat, but I wouldn't put a lot of hope here. For a project having 2-3 proposals is quite normal. If I wanted to select second project for a proposal, I just selected one interesting to me from the list http://gsoc2020.scrapinghub.com/ideas#scrapy .

whalebot-helmsman commented 4 years ago

Thanks @AKSHAYSHARMAJS

faizan2700 commented 4 years ago

I have sent a maill to respected mentors (@whalebot-helmsman , @Gallaecio ) with my gsoc application attached please review it. Thank You

whalebot-helmsman commented 4 years ago

@faizan2700 In sections about redis and kafka you write about error handling on side of queues. In scrapy there are different mechanisms for error handling. They are more scraping-specific and, I think, better suit the case. Did you think scrapy would benefit from queue-level error handling?

If processing is completed properly then the object will be removed from the processing queue and added in the finish queue.

I see an improvement of creating finish queue for introspection and debugging. Which component or module of running spider should handle this? Due to performance constraints it is may be better to use redis hash structure instead of finish queue for duplicate filtering.

Could you describe in more details section about objects tagging. Right now it looks like you going other in opposite direction - from existent solution to possible problem.

P.S. Today is a ls the last day of student application period.

faizan2700 commented 4 years ago

I thought that when we are handling errors (like if some page is not crawled properly or server error ) then it is more suitable to handle such errors at queue level/ or schedular level because it is job of scheduler to decide that which object should be processed next and more importantly when we will write support for distributed crawls ( I think ) it will be easier to maintain if storage of queues and error handling mechanism are near to each other. If current mechanisms for error handling can be implemented in distributed crawls without any problem then we can keep that. ( I just wanted to point out that we have considerable option on queue level too ).

I think scheduler should be handling this because if object is not processed successfully then it will be scheduler which will again add it in queue Or other component which is near to the scheduler.

I also think that redis hash structure is better than finish queue and bloom filters is better than both. and that's why I mentioned first finish queue redis hashes and then boom filters.

Object tagging system is little complicated where user has need to crawl several type of pages and one type of page can have links of other types too and for every type different spider is necessary in that case every request object is associated with tags and it will be allowed to be crawled by some specific spiders.

faizan2700 commented 4 years ago

I think I will discard that object tagging system from application this doesn't seem necessary to be supported.

faizan2700 commented 4 years ago

@whalebot-helmsman would you like to see the revised desing of queues if error handling is not suitable at that level. ( But after eliminating that I think there will be very simple and straight forward queue design.)

faizan2700 commented 4 years ago

@Gallaecio could you please tell me your review ( I am really sorry for being late ).

whalebot-helmsman commented 4 years ago

I think scheduler should be handling this because if object is not processed successfully then it will be scheduler which will again add it in queue Or other component which is near to the scheduler.

Item extraction takes place in spider's parse method. Success of request is also decided there. Other modules doesn't hold success criteria. In case of fail it possible to yield same request second time with flag to bypass filtering.

whalebot-helmsman commented 4 years ago

I also think that redis hash structure is better than finish queue and bloom filters is better than both. and that's why I mentioned first finish queue redis hashes and then boom filters.

I think it is better to scope out bloom filters. It is not directly related not to request queues nor to redis (or other message queue)

faizan2700 commented 4 years ago

Ok scheduler can not be good choice then. I think I will study codebase of scrapy more for understanding where should we place finish queue ( or redis hashes ).

faizan2700 commented 4 years ago

I think it is better to scope out bloom filters. It is not directly related not to request queues nor to redis (or other message queue)

Ok