internetarchive / warcprox

WARC writing MITM HTTP/S proxy
371 stars 54 forks source link

draft: skip duplicate revisits, per ait-job-id #184

Closed galgeek closed 1 year ago

ollie-iterators commented 1 year ago

This seems like an important PR to get merged.

anjackson commented 1 year ago

If anyone has chance, as someone who uses warcprox, I'd be very grateful for any information about the problems you've found that this change resolves.

ollie-iterators commented 1 year ago

If anyone has chance, as someone who uses warcprox, I'd be very grateful for any information about the problems you've found that this change resolves.

I think that this could help make it so that pages that don't need to get resaved are not resaved so that the internet archive can go to other pages that may need their saves updated.

galgeek commented 1 year ago

If anyone has chance, as someone who uses warcprox, I'd be very grateful for any information about the problems you've found that this change resolves.

current warcprox code captures many revisit records for some sites / urls, adversely affecting capture in some cases, and replay in more.

note: this PR is likely to be replaced soon by a PR for warcprox/dedup.py