Closed vbanos closed 6 years ago
Well, I know that the unit tests would fail... please let me know if this idea makes sense and I'll work more on this to address unit test issues. Thank you!
I don't think we should do it this way. There's a warcprox-meta parameter called captures-bucket
. Better would be to have an option that disables dedup unless captures-bucket
is set in warcprox-meta.
Also feel free in your PR to rename that parameter to dedup-bucket
which makes more sense now.
OK, I understand what I need to do. This task is relevant to https://github.com/internetarchive/warcprox/pull/86 After its done, I will proceed.
I have implemented this in https://github.com/internetarchive/warcprox/pull/90 and I'm closing this PR.
Warcprox handles requests with HTTP header
Warcprox-Meta
containingwarc-prefix
in a special way, creating distinct WARC files for them. In production, we care only for content in these WARCs and discard other WARCs that use the defaultWARCPROX-*
filename.Thus, I suggest to dedup only responses which have
Warcprox-Meta
andwarc-prefix
.A relevant idea is to have an extra flag to enable this behavior (e.g.
--dedup-only-prefixed
). I'm not sure if this makes sense.