richfelker / bakelite

Incremental backup with strong cryptographic confidentiality baked into the data model.
GNU General Public License v2.0
126 stars 4 forks source link

Difficulty backing up localindex #3

Open richfelker opened 2 years ago

richfelker commented 2 years ago

In order to be able to continue using an incremental backup after restoring from it, you need the localindex corresponding to it. This can be achieved by making sure it's included in the backup, but that has 2 problems:

  1. It's a large file (possibly hundreds of MB or even some GB) that's regenerated each time and cannot itself be backed up incrementally, so it adds a lot of storage and bandwidth cost to each backup if it's included, and
  2. The index backed up would be for the previous incremental backup state, not the new one being generated, which is okay if both are kept but could point to blobs that no longer exist if the previous one was pruned already.

The second problem is solvable by keeping backups of indices in a separate backup store (note: they should still be encrypted, so this would mean another bakelite backup store, not just rsync or something), but the first remains.

I think the most elegant solution would be not to backup the index at all (exclude it, either manually in exclude file, or automatically by matching inode) and instead add functionality in the restore operation to regenerate the index. A block-only index can be created simply by decrypting the blocks and mapping the sha3 of their decrypted content to the encrypted blob sha3. The inode part of the index can only be recreated when the files are actually restored into a real filesystem and assigned inode numbers. This may be problematic if the restore is taking place onto a transport medium that's different from the final filesystem the restored data will live on.

Many users may be happy with just the block index being restored, as that covers the bulk of data in a backup with mostly files larger than 4k in size; without the inode index, new inode records would just be created for everything on the next incremental backup, but all the block data would be reusable. However we could also dump an intermediate file for regenerating the index, mapping pathnames to inode records in the backup, which could be programmatically converted to an inode-based index once the files are in their final place.

richfelker commented 2 years ago

I've written and tested a proof of concept for regenerating the localindex as part of the restore operation, and it worked for restoring an continuing incremental backups from a test repository. I think this is an acceptable solution, so I'll try to polish it up and commit it. Current limitations that need to be overcome: