internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 763 forks source link

Ensure valid checkpoints can be created when recovering from errors #392

Closed anjackson closed 3 years ago

anjackson commented 3 years ago

We hit occasional problems during checkpoint writing. These are mostly due to a non-checkpoint log file not being present when attempting to make a checkpoint, due to some earlier issue (not 100% clear what). This means, when you try to checkpoint, his happens:

SEVERE: org.archive.crawler.framework.CheckpointService checkpointFailed  Checkpoint failed [Fri May 28 10:28:03 GMT 2021]
java.io.IOException: Unable to move /heritrix/output/frequent-npld/20210519154706/logs/crawl.log to /heritrix/output/frequent-npld/20210519154706/logs/crawl.log.cp00032-20210528102802
        at org.archive.io.GenerationFileHandler.rotate(GenerationFileHandler.java:127)
        at org.archive.crawler.reporting.BufferedCrawlerLoggerModule.rotateLogFiles(BufferedCrawlerLoggerModule.java:331)
        at org.archive.crawler.reporting.BufferedCrawlerLoggerModule.doCheckpoint(BufferedCrawlerLoggerModule.java:393)
        at org.archive.crawler.framework.CheckpointService.requestCrawlCheckpoint(CheckpointService.java:285)
...

But the checkpoint partially completes, so if you fix the problem (e.g. by adding an empty log file), and try to re-checkpoint, you get:

INFO: org.archive.crawler.framework.CheckpointService requestCrawlCheckpoint no progress since last checkpoint; ignoring [Fri May 28 10:29:13 GMT 2021]
no progress since last checkpoint; ignoring

Which is all well and good, except that no coherent/consistent checkpoint has been written, so it's impossible to resume the crawl.

That check is implemented here:

https://github.com/internetarchive/heritrix3/blob/37ce8d694590b0cf8cbe0a38a58c5f8ee719c4f0/engine/src/main/java/org/archive/crawler/framework/CheckpointService.java#L252-L259

This could be addressed by making it possible to ignore the lastCheckpointSnapshot and force a checkpoint. Or, we could ensure that the lastCheckpointSnapshot field only gets updated after the checkpoint is completely successfully executed.

It is plausible that you might want to force a checkpoint even if no progress has been made, e.g. because it involved resolving an issue with the crawl state itself. But this seems like a rare exception.

anjackson commented 3 years ago

Looks like this line sets the laskCheckpointSnapshot and it's in the finally so it always gets called:

https://github.com/internetarchive/heritrix3/blob/37ce8d694590b0cf8cbe0a38a58c5f8ee719c4f0/engine/src/main/java/org/archive/crawler/framework/CheckpointService.java#L316

It should be the last thing in the try clause.

anjackson commented 3 years ago

The stats are now only written if the checkpoint executes without throwing an exception, which is as good as it's likely to get for now.