Closed vbanos closed 3 years ago
I found out about this because I'm making a plugin that is going to be loaded earlier and set do_not_archive = True
for some urls that are captured too often. Then I thought that there is no point in deduping these.
When a recorded URL has
recorded_url.do_not_archive = True
, it is not written to WARC. This is checked inWarcWriterProcessor._should_archive
. We shouldn't waste time on deduping something that is not going to be written to WARC anyway.