Currently the task of storing request info to hubstorage is performed by sh_scrapy.extension.HubstorageMiddleware.
As a spider middleware, it catches only responses that are being passed to a spider.
Thus it misses responses that are:
Consumed by a downloader middleware (e.g. RobotsTxtMiddleware, RetryMiddleware, MetaRefreshMiddleware, or RedirectMiddleware).
Consumed by an item pipeline (e.g. ImagesPipeline)
Other responses that do not go through the spider middlewares.
Is there any specific reason that we do need to exclude these responses?
It's possible to use a signal handler for scrapy.signals.response_downloaded to gather more requests. This way we may still need a spider middleware for setting the "_hsparent" field.
Currently the task of storing request info to hubstorage is performed by
sh_scrapy.extension.HubstorageMiddleware
.As a spider middleware, it catches only responses that are being passed to a spider.
Thus it misses responses that are:
RobotsTxtMiddleware
,RetryMiddleware
,MetaRefreshMiddleware
, orRedirectMiddleware
).ImagesPipeline
)Is there any specific reason that we do need to exclude these responses?
It's possible to use a signal handler for
scrapy.signals.response_downloaded
to gather more requests. This way we may still need a spider middleware for setting the "_hsparent" field.