Closed leewesleyv closed 2 days ago
I've added a proposal in #20 (WIP). To summarize the issues with the current situation:
Request
object is returned for the redirect (https://docs.scrapy.org/en/2.11/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response)WaczExporter
's response_received
method is not called on the first request/response and therefore the redirect request/response are not written to the WARCWaczExporter
's response_received
where we write the success request/response to the WARC.WaczMiddleware
will start with a lookup of https://www.example.com/ in the CDXJ index. This URL is not present in the index and a 404 is returned.The main problem is that when using the WaczMiddleware
downloader middleware the requests are generated through the spider, but the crawl process does not have access to the live resource and does not know that a redirect for this URL is present. We also do not want to rely on the live data to be able to crawl the archive, this would defeat the purpose of crawling against the archive. This would mean that we do not only need the 200 responses, but also all other response (or only the redirects 307, 308?) in the archive.
Questions
What are some things in the redirect middleware that we need to take into account while implementing this in our archive extension/downloader middleware/spider middleware?
What happens when you change the order of the middlewares, so that this one is before the redirect?
Should we/do we need to write any other status codes in the archive that we currently do not do yet?
Hmmm good thing to think about. At first glance, I would think that ideally, all network interactions would (be able to) end up in the archive, including redirects, and perhaps even things like middlewares doing things like login or CDN evasion. But there are also cases where it is more convenient to have just the 'initial' requests and 'final' responses. When you configure e.g. the retry middleware, you already configure response codes. So it would also make sense to say: retry is handled there, we don't need to register the failed responses.
To make the spider work when crawling from the archive, just as online, it needs to find the corresponding response with the request, also when it was redirected. I think storing the redirects is the most straightforward approach, actually. There could be alternatives, like extension-specific fields or so, but I'd rather stick to what is standard.
If a request fails, we might want to store the failure too - but that would be a different issue. We don't need to address that here, I think.
How do we prevent duplicate request/response to be written to the archive?
For WACZs generated elsewhere, there can be duplicate requests/responses. When iterating over all responses, you will get all of them. When using the index, one may want to use the last one found (probably multiple entries in the index).
Super!
When a request is redirected to a new URL, the downloader middleware cannot resolve the redirect and will always return a 404 status code. An example of this:
If the start URL redirects to
https://www.example.com/other_page/
and you run the example spider with a previously generated WACZ (using the WACZ extension), the request will return a 404 (becausehttps://www.example.com/
is not in the CDXJ index) and stop scraping.