istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Improve request serialization handling in scheduler #132

Closed gas1121 closed 7 years ago

gas1121 commented 7 years ago

Use scrapy's request_to_dict,request_from_dict to serialize requests so we can handle post request with parameters in request body.

coveralls commented 7 years ago

Coverage Status

Coverage increased (+0.1%) to 65.908% when pulling b6824bc9e86d18ebfee56be573944d9d76233e5b on gas1121:scheduler-fix into d42bcb53905fc14c6c9a68392befa78dfd4c66ba on istresearch:dev.

madisonb commented 7 years ago

Can you elaborate on your use case for this? There originally was a reason we used a custom request_to_dict() set of methods. In comparing this with the changes above I don't see anything out of the ordinary at this point. I would like to add a new unit test to show that priority crawl submission is still respected, and may have to do some minor testing to make sure everything is still working correctly.

gas1121 commented 7 years ago

I did this change because when I use scrapy.FormRequest.from_response to generate a post request, sometimes it will contain parameters in request's body, which will be lost when being deserialized from redis.

madisonb commented 7 years ago

Ok, do you think that this supports all styles of Scrapy requests? For example I know there have been issues in the past with Splash requests which has been a long sought after feature, but prior discussion like on https://github.com/istresearch/scrapy-cluster/issues/108 moved towards continuing to customize the encoding/decoding rather than using the default scrapy setup.

gas1121 commented 7 years ago

As request_from_dict, request_from_dict will handle meta data in request and load right class when its type is not scrapy.http.Request, these function should work well with SplashRequest. Also I did some tests to scrapy-splash with a simple scrapy project and SplashRequest works fine with request_to_dict and request_from_dict, so I think it should also works fine in scrapy-cluster.

madisonb commented 7 years ago

Cool, I will have to compare it with the work I did over on this commit https://github.com/istresearch/scrapy-cluster/commit/60e1b9d201df94a9b3f2e06a5b870b6918c58c75 but I bet the combination of these two items will get full support for splash within scrapy cluster.

madisonb commented 7 years ago

I can't find anything wrong with the updates for this PR. All tests pass and this seems like an improved implementation of the cloned Scrapy function request_to_dict. Merged.