Long chains of revisit records still causing problems

ukwa / ukwa-pywb

GNU General Public License v3.0

11 stars 3 forks source link

Long chains of revisit records still causing problems #73

Open anjackson opened 2 years ago

anjackson commented 2 years ago

Ideally, we would include revisit records at playback time, as they indicate when we visited a page even if the content did not change. As of PyWB 2.6.2, large chains of redirects still seem to cause problems, and it is not clear that the closest_limit is working as expected. See https://github.com/webrecorder/pywb/pull/606

Not sure how to handle this. For now, skipping redirects from CDX queries.

The redirect_to_exact setting doesn't seem to be working now either.

anjackson commented 2 years ago

Ah, so the example I was looking at had a chain of over 300 warc/revisit entries before hitting the most recent copy of the GOV.UK robots.txt file. This is over the hard-coded 100, so this is why it didn't resolve. But even upping that, it's really slow.

ikreymer commented 2 years ago

Hm. This is caused specifically by revisit lookups or bouncing between 3xx redirects, or a combination of both? Probably the main optimization is just to include the redirect URL in the CDXJ, especially in case of redirects. If it is a chain of revisits that ends up just being a 200, then probably not much that can be done? Perhaps something to discuss also in the context of reindexing?

anjackson commented 2 years ago

It's the latter, I think. The only option would to make closest_limit configurable so I can set it to some high value easily.

But it's not urgent, and arguably not needed in the playback service.

anjackson commented 1 year ago

Current status is that I'm filtering out all revisit records at playback time. This is sub-optimal, as you can't see when pages were seen unchanged, but can't be resolved until this issue is resolved.