q-m / scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
http://developers.thequestionmark.org/scrapy-webarchive/
2 stars 0 forks source link

Support fetching live resources in downloader middleware #9

Open leewesleyv opened 1 month ago

leewesleyv commented 1 month ago

When using the downloader middleware and the request is not found, request the live resource. Add a setting or something alike that we can use the control this behaviour.

wvengen commented 1 month ago

Great idea. I would say this is ok to leave this for after the package has been published.

When you want to crawl the resulting WACZ (containing new resources), you probably want to crawl it together with the other WACZ (containing older resources). And if the old WACZ also was crawled as an 'update' to a previous one, you need to specify all of them when crawling it.

I think creating a WACZ manifest could help with this, so you can reference one file to re-crawl. Its specification is a work-in-progress, but a tool like replayweb.page already supports it afaik - see https://github.com/webrecorder/specs/issues/112 for the spec in progress.