Closed anjackson closed 3 years ago
Looks like this line sets the laskCheckpointSnapshot
and it's in the finally
so it always gets called:
It should be the last thing in the try
clause.
The stats are now only written if the checkpoint executes without throwing an exception, which is as good as it's likely to get for now.
We hit occasional problems during checkpoint writing. These are mostly due to a non-checkpoint log file not being present when attempting to make a checkpoint, due to some earlier issue (not 100% clear what). This means, when you try to checkpoint, his happens:
But the checkpoint partially completes, so if you fix the problem (e.g. by adding an empty log file), and try to re-checkpoint, you get:
Which is all well and good, except that no coherent/consistent checkpoint has been written, so it's impossible to resume the crawl.
That check is implemented here:
https://github.com/internetarchive/heritrix3/blob/37ce8d694590b0cf8cbe0a38a58c5f8ee719c4f0/engine/src/main/java/org/archive/crawler/framework/CheckpointService.java#L252-L259
This could be addressed by making it possible to ignore the
lastCheckpointSnapshot
and force a checkpoint. Or, we could ensure that thelastCheckpointSnapshot
field only gets updated after the checkpoint is completely successfully executed.It is plausible that you might want to force a checkpoint even if no progress has been made, e.g. because it involved resolving an issue with the crawl state itself. But this seems like a rare exception.