internetarchive / warcprox

WARC writing MITM HTTP/S proxy
371 stars 54 forks source link

DedupableMixin.should_dedup() improvement #153

Closed vbanos closed 3 years ago

vbanos commented 3 years ago

When a recorded URL has recorded_url.do_not_archive = True, it is not written to WARC. This is checked in WarcWriterProcessor._should_archive. We shouldn't waste time on deduping something that is not going to be written to WARC anyway.

vbanos commented 3 years ago

I found out about this because I'm making a plugin that is going to be loaded earlier and set do_not_archive = True for some urls that are captured too often. Then I thought that there is no point in deduping these.