Open anjackson opened 2 years ago
Ah, so the example I was looking at had a chain of over 300 warc/revisit
entries before hitting the most recent copy of the GOV.UK robots.txt file. This is over the hard-coded 100, so this is why it didn't resolve. But even upping that, it's really slow.
Hm. This is caused specifically by revisit lookups or bouncing between 3xx redirects, or a combination of both? Probably the main optimization is just to include the redirect URL in the CDXJ, especially in case of redirects. If it is a chain of revisits that ends up just being a 200, then probably not much that can be done? Perhaps something to discuss also in the context of reindexing?
It's the latter, I think. The only option would to make closest_limit configurable so I can set it to some high value easily.
But it's not urgent, and arguably not needed in the playback service.
Current status is that I'm filtering out all revisit records at playback time. This is sub-optimal, as you can't see when pages were seen unchanged, but can't be resolved until this issue is resolved.
Ideally, we would include revisit records at playback time, as they indicate when we visited a page even if the content did not change. As of PyWB 2.6.2, large chains of redirects still seem to cause problems, and it is not clear that the
closest_limit
is working as expected. See https://github.com/webrecorder/pywb/pull/606Not sure how to handle this. For now, skipping redirects from CDX queries.
The
redirect_to_exact
setting doesn't seem to be working now either.