Support fetching live resources in downloader middleware

q-m / scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.

2 stars 0 forks source link

Great idea. I would say this is ok to leave this for after the package has been published.

When you want to crawl the resulting WACZ (containing new resources), you probably want to crawl it together with the other WACZ (containing older resources). And if the old WACZ also was crawled as an 'update' to a previous one, you need to specify all of them when crawling it.

I think creating a WACZ manifest could help with this, so you can reference one file to re-crawl. Its specification is a work-in-progress, but a tool like replayweb.page already supports it afaik - see https://github.com/webrecorder/specs/issues/112 for the spec in progress.

q-m / scrapy-webarchive

Support fetching live resources in downloader middleware #9