internetarchive / heritrix3

Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
https://heritrix.readthedocs.io/
Other
2.83k stars 763 forks source link

Checkpoints 'spoiled' when used to resume crawls #277

Closed anjackson closed 4 years ago

anjackson commented 5 years ago

We've seen occasional mysterious issues when resuming crawls from checkpoints multiple times. Close inspection of the behaviour when resuming the crawl indicates that checkpoints can only reliably be used once.

The code uses the DbBackup helper to manage backups. This depends on calling startBackup, the docs for which note:

Following this method call, all new data will be written to other, new log files. In other words, the last file in the backup set will not be modified after this method returns.

i.e. the BDB-JE makes sure the backup is consistent and after that those files should no longer be altered (BDB is an append-only DB system). The backup documentation implies this flushed, sync'd consistency is necessary for the backup to work.

(Note that a H3 checkpoint is not same thing as a BDB-JE checkpoint - the former is a point in time backup, but the latter is a flush/sync operation).

However, when resuming from a crawl, rather than copying the checkpoint files (as recommended by the documentation), Heritrix3 uses hard links (and cleans out any other state files not part of the checkpoint).

This causes an issues because, having resumed a crawl, I noticed that the last .jdb file in the checkpoint was being changed! From the backup behaviour, we might expect that existing files would not be changed, but in fact when resuming a crawl, the system proceeds by appending data to the last .jdb file.

As this file is a hard link, this activity also changes the contents of the checkpoint. Furthermore, if we are resuming the crawl from one older checkpoint among many, all subsequent checkpoints are also modified.

As an example, we recently attempted to resume from a checkpoint, hit some difficulties and had to force-kill the crawler. After this, we attempted to re-start from that checkpoint, and hit some very strange behaviour that hung the crawler. (Fortunately, we happen to have a backup of those checkpoints!)

To avoid this, we could actually copy the last log file rather than make a hard link back to the checkpoint. Alternatively, it may be possible to call start/stop backup immediately upon restoring the DB, which should prevent existing files being appended to (assuming no race conditions).

anjackson commented 5 years ago

@ato points out that enabling copy-on-write via cp --reflink=auto might present a safe and space-efficient alternative on file systems that support it.

But after discussion we agreed we should implement the simplest possible fix first, and check it works, before looking into that as an optimisation.

anjackson commented 5 years ago

Example results. From the original checkpoint:

...
-rw-r--r--.  1 root root  9798691 Jul 26 17:37 00094541.jdb,9798691
-rw-r--r--.  1 root root  8962851 Jul 26 17:37 00094542.jdb,8962851
-rw-r--r--.  1 root root  2129511 Jul 26 18:16 jdbfiles.manifest

(yeah yeah it's running as root! sue me! ;-) )

From the same folder after using the checkpoint (but then killing the crawl):

-rw-r--r--.  2 root root  9798691 Jul 26 17:37 00094541.jdb,9798691
-rw-r--r--.  1 root root  9999584 Aug  7 11:13 00094542.jdb,8962851
-rw-r--r--.  1 root root  2129511 Jul 26 18:16 jdbfiles.manifest

i.e. the last file has grown by 1036733 bytes, i.e. about a MB.

anjackson commented 5 years ago

@ato points out that the setup() code is taking steps to ensure the checkpoint is preserved:

https://github.com/internetarchive/heritrix3/blob/aa705bef2eb8fbbb9aeb56995e713a7b6ba0ed00/commons/src/main/java/org/archive/bdb/BdbModule.java#L291-L296

So, perhaps this doesn't work, or perhaps some other activity is managing to interact with the frontier while this is going on?

https://github.com/internetarchive/heritrix3/blob/aa705bef2eb8fbbb9aeb56995e713a7b6ba0ed00/commons/src/main/java/org/archive/bdb/BdbModule.java#L233-L238

anjackson commented 5 years ago

There was no later 00094543 file in the above example, which indicates the start/stop backup logic isn't happening at all, or doesn't work as expected.

If these exceptions were getting silently swallowed, that might allow the system to silently fail to get to the freeze bit.

https://github.com/internetarchive/heritrix3/blob/aa705bef2eb8fbbb9aeb56995e713a7b6ba0ed00/commons/src/main/java/org/archive/bdb/BdbModule.java#L239-L243

But Spring is calling start() and would surely die on such an exception rather than swallow it!?

anjackson commented 5 years ago

Ah, I think I've got something. Right, looking at the BDB-JE code, I noticed that a lastFileInBackup property was being set to the last but one log file. When taking a checkpoint, H3 uses getLogFilesInBackupSet and this drops the last log file and does not include it in the checkpoint.

i.e. BDB-JE does not 'roll' the files so the current set will no longer change, it just omits the latest file and gives you the set that can no longer change.

Hence, when we call dbBackup.startBackup(); on recovery, it does nothing at all.

n.b. This doesn't actually contradict the startBackup documentation, but it's easy to misinterpret as doing something it doesn't.

UPDATE

Hmm, the startBackup code should call forceLogFileFlip which is supposed to do what we expect.

@ato points out ENV_RECOVERY_FORCE_NEW_FILE which is available in more recent DBD-JE versions. Perhaps this indicates issues with flipping the files on recovery?

anjackson commented 5 years ago

Related docs:

Restoring from a backup

As described in the sections above, the restore procedure is to copy the files from a backup set into an empty directory. Depending on the type of backup that was performed (see above), either all files from the backup set are copied, or only the files on a list that was created during the backup.

There is one additional consideration when performing a restore, under the following condition:

  • Incremental backups are used, AND
    • the backup was created using DbBackup with JE 6.2 or earlier, OR
    • the backup was created in a read-only JE environment.

If the above condition holds, after copying the files an additional step is needed. To enable the creation of future incremental backups using the restored files, the EnvironmentConfig.ENV_RECOVERY_FORCE_NEW_FILE parameter should be set to true when opening the JE Environment for the first time after the restore. When this parameter is set to true, the last .jdb file restored will not be modified when opening the Environment, and the next .jdb file will be created and will become the end-of-log file.

WARNING: When the above special condition is true and this property is not set to true when opening the environment for the first time after a restore, then the backup set that was restored may not be used as the basis for future incremental backups. If a future incremental backup were performed based on this backup set, it would be incomplete and data would be lost if that incremental backup were restored.

When JE 6.3 or later is used to create the backup, and the backup is created in a read-write environment (the usual case), this extra step is unnecessary. In this case, startBackup() will have added an "immutable file" marker to the last file in the backup and this will prevent that file from being modified, just as if the ENV_RECOVERY_FORCE_NEW_FILE parameter were set to true.

It's hard to tell what's going on with our old version, and whether this is the actual cause.

But, it does indicate that a sensible alternative option would be to upgrade BDB-JE to >= 6.3, which is something we want to do anyway.

anjackson commented 5 years ago

Well, superficially at least, the upgrade seems fairly straightforward: https://github.com/ukwa/heritrix3/tree/upgrade-bdb-je

One test was using Thread.interrupt() which later versions of BDB-JE don't like, but apart from that it compiles and tests fine. I'll need to do some work to make full-scale testing more straightforward first though I think.