q-m / scrapy-webarchive

A plugin for Scrapy that allows users to capture and export web archives in the WARC and WACZ formats during crawling.
http://developers.thequestionmark.org/scrapy-webarchive/
2 stars 0 forks source link

Redirect URLs not resolved correctly #19

Closed leewesleyv closed 2 days ago

leewesleyv commented 1 week ago

When a request is redirected to a new URL, the downloader middleware cannot resolve the redirect and will always return a 404 status code. An example of this:

class ExampleSpider(Spider):
    name = "example"
    start_urls = ["https://www.example.com/"]

If the start URL redirects to https://www.example.com/other_page/ and you run the example spider with a previously generated WACZ (using the WACZ extension), the request will return a 404 (because https://www.example.com/ is not in the CDXJ index) and stop scraping.

leewesleyv commented 1 week ago

I've added a proposal in #20 (WIP). To summarize the issues with the current situation:

  1. Start URLs contains https://www.example.com/, this is where we will start our scraping process
  2. Request goes through downloadermiddleware and a new Request object is returned for the redirect (https://docs.scrapy.org/en/2.11/topics/downloader-middleware.html#scrapy.downloadermiddlewares.DownloaderMiddleware.process_response)
  3. The WaczExporter's response_received method is not called on the first request/response and therefore the redirect request/response are not written to the WARC
  4. The redirected request is scheduled and yielded, eventually returning a 200 and ending up in the WaczExporter's response_received where we write the success request/response to the WARC.
  5. At the end of the crawl the WACZ is created and exported
  6. Starting a new crawl based on the WACZ URI with the WaczMiddleware will start with a lookup of https://www.example.com/ in the CDXJ index. This URL is not present in the index and a 404 is returned.

The main problem is that when using the WaczMiddleware downloader middleware the requests are generated through the spider, but the crawl process does not have access to the live resource and does not know that a redirect for this URL is present. We also do not want to rely on the live data to be able to crawl the archive, this would defeat the purpose of crawling against the archive. This would mean that we do not only need the 200 responses, but also all other response (or only the redirects 307, 308?) in the archive.

Questions

wvengen commented 4 days ago

What are some things in the redirect middleware that we need to take into account while implementing this in our archive extension/downloader middleware/spider middleware?

What happens when you change the order of the middlewares, so that this one is before the redirect?

Should we/do we need to write any other status codes in the archive that we currently do not do yet?

Hmmm good thing to think about. At first glance, I would think that ideally, all network interactions would (be able to) end up in the archive, including redirects, and perhaps even things like middlewares doing things like login or CDN evasion. But there are also cases where it is more convenient to have just the 'initial' requests and 'final' responses. When you configure e.g. the retry middleware, you already configure response codes. So it would also make sense to say: retry is handled there, we don't need to register the failed responses.

To make the spider work when crawling from the archive, just as online, it needs to find the corresponding response with the request, also when it was redirected. I think storing the redirects is the most straightforward approach, actually. There could be alternatives, like extension-specific fields or so, but I'd rather stick to what is standard.

If a request fails, we might want to store the failure too - but that would be a different issue. We don't need to address that here, I think.

How do we prevent duplicate request/response to be written to the archive?

For WACZs generated elsewhere, there can be duplicate requests/responses. When iterating over all responses, you will get all of them. When using the index, one may want to use the last one found (probably multiple entries in the index).

wvengen commented 1 day ago

Super!