ukwa / ukwa-heritrix

The UKWA Heritrix3 custom modules and Docker builder.
9 stars 7 forks source link

Ensure we keep crawl logs files #20

Closed anjackson closed 5 years ago

anjackson commented 5 years ago

To my surprise/dismay, it seems telling Heritrix to only keep the last checkpoint also means it deletes the previous checkpoint log files! This doesn't cause a problem when we're promptly syncing to HDFS, but is not desirable on other systems or in case we hit a bottleneck.

anjackson commented 5 years ago

We already allowed this to be set via CHECKPOINT_FORGET_ALL_BUT_LATEST, but best to switch the default to false.