internetarchive / warcprox

WARC writing MITM HTTP/S proxy
378 stars 54 forks source link

New --blackout-period option to skip writing redundant revisits to WARC #99

Closed vbanos closed 6 years ago

vbanos commented 6 years ago

Add option --blackout-period (default=0)

When set and if the record is a duplicate (revisit record), check the datetime of dedup_info and if its inside the blackout_period, skip writing the record to WARC.

Add some unit tests.

This is an improved implementation based on @nlevitt comments here: https://github.com/internetarchive/warcprox/pull/92

nlevitt commented 6 years ago

Thanks for the updated PR @vbanos.

An important consideration occurs to me. We don't want blackout to apply in case the url is different. Can you add a check that the url matches? It doesn't affect you guys, because cdx dedup is not url-agnostic, but it matters for other dedup backends.

Also, just to add some fear, uncertainty and doubt. What if the same url is captured twice with the same payload, but different headers? Wayback replays response header from the revisit record. I can't think of a realistic case where that would matter, but it might be worth seeing if anybody else can.

vbanos commented 6 years ago

NP, I updated my PR.

ato commented 6 years ago

What if the same url is captured twice with the same payload, but different headers? I can't think of a realistic case where that would matter

Might be worth comparing the Location header and still writing a revisit record if it's different? That seems the most obvious one that might meaningfully change while the payload remains identical.

nlevitt commented 6 years ago

Thanks @vbanos the pull request is looking good.

I'm a little nervous about this wrt differing http response headers and status codes. Dedup backends don't generally know about the status code or the headers, so checking for a match would mean storing additional fields in the dedup db, which I'm not at all keen to do.

We could maybe hard-code a rule to dedup strictly 200's. But is it worth doing that for this feature?

Or maybe this feature should be specific to the cdx dedup backend, since for that one we can check that the status code matches? (Even so, we should probably not dedup 3xx responses, even with matching status code, because redirect destination could differ)

vbanos commented 6 years ago

@nlevitt I'm sorry but I don't understand your comment. I must be missing something.

We don't deal at all with the dedup process. If recorded_url.dedup_info is there (I don't care how) this means its a duplicate and we compare the datetimes to decide if its inside the blackout period or not.

Why does this change how dedup works and: it would mean storing additional fields in the dedup db? Also, what you mean by: differing http response headers and status codes ? I don't understand where we need these extra fields.

About the idea that this feature should be specific to the cdx dedup, I think that I could it make it easily by checking Options for CDX cli options and setting a flag ._using_cdx_dedup. Then I could use that in ._in_blackout method.

Thank you!

nlevitt commented 6 years ago

Ok, to explain my concern about http status codes and response headers. As you know, deduplication decisions are made based solely on the payload digest (sha1). That means that if two captures have the same payload, a revisit record will be written, even if the response headers differ. This is no problem because the revisit record includes the response headers, and wayback plays back the headers from the revisit record (along with the payload from the referenced response record). But with this blackout feature, we don't record the headers. So if they differ in any significant way between the original capture and the blacked-out capture, it could screw up playback.

Now I'm thinking again that a problem is unlikely. After all, it's the same url. It seems highly unlikely that the same url would return the same payload but with headers that differ in a significant way.

I'll go ahead and merge

vbanos commented 6 years ago

Thank you for the explanation!