internetarchive / warcprox

WARC writing MITM HTTP/S proxy
380 stars 54 forks source link

Dedup only urls with Warcprox-Meta and warc-prefix #88

Closed vbanos closed 6 years ago

vbanos commented 6 years ago

Warcprox handles requests with HTTP header Warcprox-Meta containing warc-prefix in a special way, creating distinct WARC files for them. In production, we care only for content in these WARCs and discard other WARCs that use the default WARCPROX-* filename.

Thus, I suggest to dedup only responses which have Warcprox-Meta and warc-prefix.

A relevant idea is to have an extra flag to enable this behavior (e.g. --dedup-only-prefixed). I'm not sure if this makes sense.

vbanos commented 6 years ago

Well, I know that the unit tests would fail... please let me know if this idea makes sense and I'll work more on this to address unit test issues. Thank you!

nlevitt commented 6 years ago

I don't think we should do it this way. There's a warcprox-meta parameter called captures-bucket. Better would be to have an option that disables dedup unless captures-bucket is set in warcprox-meta.

nlevitt commented 6 years ago

Also feel free in your PR to rename that parameter to dedup-bucket which makes more sense now.

vbanos commented 6 years ago

OK, I understand what I need to do. This task is relevant to https://github.com/internetarchive/warcprox/pull/86 After its done, I will proceed.

vbanos commented 6 years ago

I have implemented this in https://github.com/internetarchive/warcprox/pull/90 and I'm closing this PR.