internetarchive / warcprox

WARC writing MITM HTTP/S proxy
371 stars 54 forks source link

in-batch dedup #165

Closed galgeek closed 2 years ago

galgeek commented 2 years ago

Many duplicate captures in brozzler crawls are due to no dedup within dedup batches. This MR adds in-batch dedup for trough deduplication, and increases warcprox's dedup batch window.

vbanos commented 2 years ago

I have a more general comment: With this MR, you embed the logic of in memory dedup in BatchTroughLoader. Wouldn't it be better to make a mixin like DedupableMixin to be able to use it in more places and have clearer code/logic separation? Thanks.

galgeek commented 2 years ago

Thank you @vbanos — your comments have been very helpful!

I think we're pretty eager to fix this issue for BatchTroughLoader for now, and make it more general down the road.