scrapinghub / scrapinghub-entrypoint-scrapy

Scrapy entrypoint for Scrapinghub job runner
BSD 3-Clause "New" or "Revised" License
25 stars 16 forks source link

Possibility of storing more request info to hubstorage. #41

Closed starrify closed 7 years ago

starrify commented 7 years ago

Currently the task of storing request info to hubstorage is performed by sh_scrapy.extension.HubstorageMiddleware.
As a spider middleware, it catches only responses that are being passed to a spider.

Thus it misses responses that are:

  1. Consumed by a downloader middleware (e.g. RobotsTxtMiddleware, RetryMiddleware, MetaRefreshMiddleware, or RedirectMiddleware).
  2. Consumed by an item pipeline (e.g. ImagesPipeline)
  3. Other responses that do not go through the spider middlewares.

Is there any specific reason that we do need to exclude these responses?

It's possible to use a signal handler for scrapy.signals.response_downloaded to gather more requests. This way we may still need a spider middleware for setting the "_hsparent" field.