Closed galgeek closed 2 years ago
I have a more general comment:
With this MR, you embed the logic of in memory dedup in BatchTroughLoader
.
Wouldn't it be better to make a mixin like DedupableMixin
to be able to use it in more places and have clearer code/logic separation? Thanks.
Thank you @vbanos — your comments have been very helpful!
I think we're pretty eager to fix this issue for BatchTroughLoader
for now, and make it more general down the road.
Many duplicate captures in brozzler crawls are due to no dedup within dedup batches. This MR adds in-batch dedup for trough deduplication, and increases warcprox's dedup batch window.