istresearch / scrapy-cluster

This Scrapy project uses Redis and Kafka to create a distributed on demand scraping cluster.
http://scrapy-cluster.readthedocs.io/
MIT License
1.18k stars 324 forks source link

Two new middlewares #80

Closed madisonb closed 8 years ago

madisonb commented 8 years ago

This PR addresses the issue raised in #78 about the fact that it is difficult to add new spiders when you need to manually pass through meta fields and increment the stats collectors.

Meta Passthrough Middleware - passes response meta fields to new request's generated from the spider. The spider does not have to manually copy over all meta fields, instead the middleware will pass through all meta fields that have not been set already in the new request. This allows the spider to still set meta fields or override things in the Request, and not have to worry about the middleware stomping over them.

Redis Stats Middleware - Moves the stats collection from the spider and into the spider middleware stack. This analyzes the response as they come in and updates the stats collection within redis. This again removes an unnecessary step when processing the response in your spider.

Both of these reduce spider complexity and move logic into a more portable area. I added tests and documentation as well.

coveralls commented 8 years ago

Coverage Status

Coverage decreased (-0.09%) to 65.202% when pulling bfde8ec7d053515ab9a7cd59362514090857fb24 on middlewares-78 into 3a08b4d86b32a34764ff56fa4522f60fec334b80 on dev.