replace lock files with two-phase compaction

mappu commented 7 years ago

Restic currently uses lock files to prevent dangerous race conditions when performing non-additive operations (basically, prune).

The lock file system works quite well across all supported backends, although there's some downsides. A forgotten / crashed machine will not clear the exclusivelock file; you can clear it manually, but maybe the other machine was a laptop that had only gone to sleep, and when it re-awakes it will wreak havoc.

There's an alternative: two-phase fossil compaction.

The project "duplicacy" is shared-source (not FOSS) but has an interesting design document available on this topic:

https://github.com/gilbertchen/duplicacy/blob/master/DESIGN.md

Can these ideas be applied to restic at the pack level? Can we fossilize a pack by only making changes to the index files?

mappu commented 7 years ago

At the time of writing, the duplicacy project has this to say about restic on its README.md file:

restic is a more recent addition. It is worth mentioning here because, like Duplicacy, it is written in Go. It uses a format similar to the git packfile format. Multiple clients backing up to the same storage are still guarded by locks. A prune operation will therefore completely block all other clients connected to the storage from doing their regular backups. Moreover, since most cloud storage services do not provide a locking service, the best effort is to use some basic file operations to simulate a lock, but distributed locking is known to be a hard problem and it is unclear how reliable restic's lock implementation is. A faulty implementation may cause a prune operation to accidentally delete data still in use, resulting in unrecoverable data loss. This is the exact problem that we avoided by taking the lock-free approach.

fd0 commented 7 years ago

I've seen what duplicacy does, but I have a hard time understanding how to apply this to restic without changing the repository format completely (and basically doing exactly what duplicacy does).

mappu commented 7 years ago

Here is a proposal:

During a repository prune, the prune binary should:

Find currently-running jobs ("X,Y,Z")
Determine which packs can be removed ("A,B,C")
Find currently-running jobs If they do not match $1, return to step $1 (or abandon)
Upload a new index with all known packs still, but two additional messages: "A,B,C packs can be deleted after X,Y,Z jobs complete without using them" "this index is provisional until confirmed"
Find currently-running jobs If they match the jobs from $1 and $3, then upload an extra index file, containing no content except "confirming $4". else if they do not match: delete the index from step $4, return to step $1 (or abandon)

During a backup job: Any new backup jobs will load indexes at first. The backup job should not use any of those packs A,B,C at all, regardless of whether the mark is provisional. The only jobs that will use packs A,B,C are those X,Y,Z jobs that were currently running. The backup job should additionally embed its lock ID into the snapshot json, so that a snapshot can be identified as "belonging to X/Y/Z".

During a successive prune:

When a prune next runs,
- If backup jobs X,Y,Z are no longer running (no lock files for them) then
- it can check if X,Y,Z needed to use packs A,B,C.
  - If they were unused, it is safe to remove the packs without needing an exclusive lock.
  - If they were used, then prune can check the indexes to determine exactly which packs were used, and only delete the packs that were unused.

Comments:

Since it requires cooperation from all restic clients, this should be part of a breaking repository version number bump.
This proposal excludes repacking, and simplifies prune to only delete entire packs. Honestly, this is more realistic when cloud storage is involved.
Lock files still exist, but, none of them need to be exclusive.

I am interested in discussing this proposal, and implementing a PoC to test performance and correctness, but since this work is related to my $DAYJOB i cannot legally submit nontrivial changes.

fd0 commented 7 years ago

I'll need to think about this, but I like the general direction. We can certainly options so that an index file is able to mark pack files as deprecated.

The lock file ID btw is changed regularly, the lock is refreshed (recreated) every five minutes.

ifedorenko commented 6 years ago

I believe many storage providers do not guarantee consistent file reads and at least some do not guarantee consistent directory list. For example here is a quote from Amazon S3 documentation: "A process writes a new object to Amazon S3 and immediately lists keys within its bucket. Until the change is fully propagated, the object might not appear in the list".

This means that both the current lock-based and the proposed lock-free implementations allow prune to discard blobs used by live snapshots. At least in theory.

I have an alternative lock-free idea, which basically boils down to explicit wait time between a pack is marked for deletion and when the pack is actually deleted. This approach does not completely eliminate possibility of data loss but makes expected time to consistency explicit, so we can decide (or let users configure) timeout before we give up waiting for files to appear. This approach does require ability to track time between first and second compaction phases, but all other repository operations do not need to know about time nor require any additional coordination. I can elaborate this idea in more details if there is interest.

alphapapa commented 6 years ago

This means that both the current lock-based and the proposed lock-free implementations allow prune to discard blobs used by live snapshots. At least in theory.

Thank you for pointing that out. I guess this means that, to be as safe as possible, Restic's backups and prunes should only be used with local repos--and for remote repos, something like rclone should be used to sync local repos to remote storage. If this is indeed the case, it should probably be documented; even though it makes Restic sound bad, and even though it's unlikely, it would be less than full disclosure to not mention it in the docs.

fd0 commented 6 years ago

That's why restic does not simply check for locks, creates a new lock and is done, the process is a bit more complicated, starting here: https://github.com/restic/restic/blob/c40002246d3804555073d7c6679a487acf8012fd/internal/restic/lock.go#L81

It checks for other locks, creates a lock, waits a bit, checks for other locks again. That works astonishingly well in practice.

Any other lock-free process would need an atomic Rename() or Move() operation, am I right here? Do all backends provide that?

ifedorenko commented 6 years ago

It varies backend to backend, and unfortunately not all backends document their concurrency models. I do not believe S3 alone can be used to achieve bullet-proof reliable prune by itself, but can be augmented with other AWS services like SQS or DynamoDB (and probably few others). Google Cloud Storage provides strong bucket list consistency, so both current lock-based and proposed lock-free approaches should work reliably. I could not find anything about B2 and Microsoft really needs TL;DR version of their Azure blob storage documentation.

ifedorenko commented 6 years ago

I have yet another proposal, which boils down to

Introduce new lock types to lock individual blobs and snapshots
Use double-check locking to detect and retry conflicting locks

This approach does not require new repository format and only needs strongly consistent repository list operation (which can be approximated with multiple eventually consistent lists). I have a google doc with little more details (just request access if you want to comment).

I am happy to work on the implementation but obviously won't do anything until there is agreement about the approach.

gebi commented 6 years ago

I made a suggestion which would remove the need for a lockfile altogether in #1544 without the need to change the backend storage format and individual calls to the backend for additional locking checks.

Summary:

Defered prune: purge just writes the objects it wants to delete to a file and is given this list on the next run and only delete objects it has seen in the previous run.

Defered prune: As for parallel prunes, why not use a two-phase "commit' for prunes? restic prune only writes the objects it would delete out to a file, on the next run it would run normally meaning first create a list of objects to prune from the whole repository, and after that only delete files found in the list written on the run before. All objects not in the list are written out into a new list. This would remove the need for any locking with the safety guarantee to not delete objects from running backups which where shorter than the prune interval (which if prune only runs every month is quite a good safety margin). This idea would have the additional advantage as to not need a new repository format too. The list might be possible to store in the backend store itself, though i'd prefere not to as the client should not assume that he can delete files from the storage.

I'd prefere to not have any finer grained locking as this might influence the performance pretty negatively, especially for cloud storages with high latency.

ifedorenko commented 6 years ago

purge just writes the objects it wants to delete to a file

This file tells backup, check, etc, what objects to avoid. I called this file "object lock file", but I think otherwise we are talking about the same approach.

fd0 commented 6 years ago

Thanks for all the suggestions! I feel this is a productive discussion which will (eventually) result in an implementation :)

I made a suggestion which would remove the need for a lockfile altogether in #1544 without the need to change the backend storage format

I'm not sure which suggestion you're talking about, would you mind repeating it here?

Just "not create lock files" does not work for all users with the current implementation, and since the current repo format definition states that there will be lock files, not creating them is also a "change" in this sense. While this would allow us to keep data in the repo (we don't have to rewrite everything), it would still require incrementing the repo version number in the config. Which is a breaking change for older clients (for good reasons).

Also, adding new files to the repo is also a change.

gebi commented 6 years ago

@fd0 i've already repeated the suggestion here in my post from 15.Feb, sorry if i was not clear enough.

basically it's a two phase commit but with other tradeoff's, eg. that the two phase commit is done "offline" and in the sole responsibility of the client doing the prune (not for the client doing the backup/write of the objects).

the core idea is that the client stores locally a list of object it want's to prune, and only on the next prune run really deletes them, which gives a safety margin against corruption for every backup that has not taken longer than the interval of the two prune calls.

Though this idea changes the contract in the repo format about lock files.

Thus there is an additional idea, just write the lock file, but never delete them. This would change the handling in such a way that restic would write a lock file for the current instance to be available at least once. Thus all "dumb" clients doing prune without two phase commit for deletion of object would error out and be safe in that regard. All clients with two phase commit for prune would just ignore the lock file as they don't need them. For creating backups with restic, this code would just ignore the lock file as it's not needed on the backup side, the only thing that can happen as i can see is that objects would possible be duplicated in the backend store, but no data being corrupted.

ifedorenko commented 6 years ago

I think there are many ways to skin this cat, but in the end we will need to prove the new implementation guarantees snapshots can't reference deleted blobs under all concurrency and failure scenarios. The secondary goal is to guarantee all unused pack and index files are eventually deleted. Right now I don't see how to do that without backup knowing the list of blobs being deleted by concurrent prune, which I think in turn requires strongly consistent datastore to share the list of blobs between backup and prune. I still believe we are thinking about the same basic idea, just want to emphasize the guarantees the new implementation should provide.

gebi commented 6 years ago

@ifedorenko "Right now I don't see how to do that without backup knowing the list of blobs being deleted by concurrent prune"... please read my suggestion from above again and feel free to ask questions.

But right now i feel that you have neither read nor understood the suggestions i've given (now even a second time with additional information).

ifedorenko commented 6 years ago

I did read, multiple time, honestly ;-)

When prune selects blobs it wants to delete, backup must not reuse to-be-deleted blobs. Likewise, when backup selects blobs it wants to reuse, prune must not delete to-be-reused blobs. Do you agree these are required properties of any prune/backup implementation?

I think you propose to:

remove to-be-delete blobs from the index, so they are not reused by subsequent backups.
wait long enough for all concurrently running jobs to finish (and I assume for eventually consistent backends like S3 to resolve all inconsistencies).
validate blobs selected in phase 1 are not used by any snapshot
delete blobs selected in phase 1

Did I get your proposal more or less right or completely misunderstood your idea?

fd0 commented 6 years ago

@gebi Thanks for the additional information. I think I understood what you've proposed, I need to think about this for a bit. Would you mind confirming that @ifedorenko's summary (thanks!) is accurate?

gebi commented 6 years ago

Hi,

thx for going through all my ideas about a lockless prune operation!

@ifedorenko, yes you are on the right track ad 4. you understood correctly. ad 3. exactly correct too. ad 2. is done with eventual consistency yes, but the time intervals thought about are way above anything eventually consistency delays you know of any common cloud storage (eg. i thought about a time delay of a month or more, but the safety margin depends on the the difference to the time it takes for your longest backup job to run). ad 1. this is indeed a point that needs a seconds thought and might it necessary to use a three phase commit (idea below).

With your point 1 there is an additional step needed for prune operation:

prune checks which backups it want's/is instructed to remove, this results in a list of objects to remove from the shared datastore.
prune stores this list locally on the machine running the prune, let's call this list the list of objects indended to be deleted.
Prune takes the list of objects intended to be deleted, creates the set difference of this list with all objects referenced by backups in the timeframe between 1. and 3. this creates a new list of objects to be deleted but this time prune "proposes" this list into the datastore as a new index. (this is an additional step for the missing handling of your point 1).
On the 3. invocation prune can now finally delete the difference between the "old" index from before point 3 and the one created in 3. subtracted for objects that are referenced in backups that the initial prune should not have been deleted (those are objects from backup runs that where executed in parallel to the prune operations and are then again valid objects and need to be added to the index again).

stewartadam commented 6 years ago

For context, what operations are we wanting to support in parallel? Are we scoping to support only one client per repo (but we want that client to be able to issue backup and prune operations at once), or could one machine be creating a snapshot while another with the same source data be issuing a prune?

From @ifedorenko's earlier comment:

I think there are many ways to skin this cat, but in the end we will need to prove the new implementation guarantees snapshots can't reference deleted blobs under all concurrency and failure scenarios

I agree, especially if this is going to make a breaking change to backwards compatibility (after all, if there's a breaking change then why not implement that change to be able to avoid locks altogether, without compromises)?

from @gebi's comment:

which gives a safety margin against corruption for every backup that has not taken longer than the interval of the two prune calls.

If I am understanding your proposals above, this is correct but it still would not handle the case if a new backup command was issued just after the second prune operation started (that process that would do actual deletion). Is that correct?

ifedorenko commented 6 years ago

I want to run backup on one computer and prune on another concurrently. I want to run backup on multiple systems concurrently too. I also want to be able to start prune on multiple systems concurrently, but I would not mind if only one instance did any work while other instances waited.

fd0 commented 6 years ago

I've tried to outline and describe what I can imagine doing here: https://gist.github.com/fd0/d24abf8301724988919c80e65aba9b6d

There are a few open questions at the end of the document, and it feels to me I forgot something. So, please have a look and let me know what you think :)

cdhowie commented 5 years ago

I can imagine a simple two-step prune that has an exclusive lock for a shorter amount of time. It would still disallow some other operations (prune, rebuild-index, check) while running.

It would require the introduction of a new "upgradeable lock" type. This kind of lock prevents other concurrent upgradeable locks and exclusive locks, but allows concurrent shared locks. It can be upgraded to an exclusive lock.

Prune acquires an upgradeable lock on the repository.
All of the standard prune operations are completed, through the object-marking garbage collection but the process stops short of rewriting or deleting any packs. (We do "mark" but not "sweep.")
After this completes, prune upgrades its lock to an exclusive lock. This should fail if there are any other locks in the repository. (If there are, prune should wait until there are no other locks.)
Once the exclusive lock is acquired, prune will look for any new snapshot objects. If any are found, it reloads the indexes and does object-marking against those snapshots.
The process continues as normal, deleting/rewriting packs, and rebuilding indexes.

This approach allows other clients to write to the repository while prune is busy marking objects in memory, which in my experience is the longest part of prune's operation.

I think the new lock type can be added in a backwards-compatible way. The lock would otherwise look like a shared lock but with a new "upgradeable" flag. This will allow other shared-lock operations (like backup) to operate correctly with the new prune operation even if those clients do not understand the flag. Existing exclusive-lock operations will also work correctly since they will refuse to operate in the presence of any lock.

egalanos commented 4 years ago

I've seen what duplicacy does, but I have a hard time understanding how to apply this to restic without changing the repository format completely (and basically doing exactly what duplicacy does).

That may actually be the best way to go to make it work better long term, but would obviously be a massive rewrite of so much of the code :-/

I personally had to switch my backups to duplicacy as the restic design impacts its memory use and has left me with some old backups that I can't easily restore from :-(

aawsome commented 2 years ago

Just want to note that I implemented a two-phase compaction in rustic.

The idea is the following:

in the index format additionally to packs there is a stuct packs-to-delete which is basically the same as packs but also has an extra manditory attribute time for each pack to save the time when that pack was marked for removal.
The prune works basically the same as now, but instead of removing packs, they are kept and saved in the index as packs-to-delete.
Commands like backup don't ever use packs-to-delete. This also allows restic versions which don't know about this struct to backup to a repo while another two-phase pruning is running in parallel.
prune also handles packs-to-delete: those packs are either recovered, i.e. moved to the packs struct (when prune realizes that they in fact are needed, e.g. by backup which ran parallel to a previous prune ) or kept in the state to-delete (when the mark is not old enough) or really removed (when the mark is old enough).

Feel free to try it out at www.github.com/rustic-rs/rustic.

jjYBdx4IL commented 1 month ago

"The lock file system works quite well across all supported backends"

I think that's a wrong assumption. Lock files are not as trivial as many think. There is a reason why NFS needs special lock servers and why distributed systems' bottlenecks are usually the coordinators (~lock servers). Just putting a lock file on an arbitrary backend does NOT prevent race conditions as that backend usually does NOT guarantee synchronized visibility to different remotes (especially not cloud based systems because they use relaxed consistency requirements to be able to distribute stuff efficiently).

IMHO the easiest way to fix this is to use a tiered locking system: primarily use a local lock-file (or better a system-wide semaphore) and if you want, put an additional lock-file on the remote side which indicates the current client locking the repository. However you can also just omit the remote file since that will never work reliably. If you use more than one client, there is only one way: make sure to use separate repositories (ie put the client name/address into the repository path/id etc.).

aawsome commented 1 month ago

@jjYBdx4IL You are right - eventually consistency is what most cloud storages do guarantee. And this is actually a bad idea to use for locking. Now, restic operations like prune usually run for a pretty long time and the used cloud storages usually are able to synchronize the (lock) file information pretty fast - so the probability of getting an error due to this is actually quite low. But note I wrote "usually" - we don't know what happens if the cloud storage gets under stress for some reason...

There is the alternative of using delayed check for files. This is how a two-phase prune actually works, but it could also be a way to get locks really reliable. Of course, this depends on good and valid assumptions for sync times - under all circumstances.

restic / restic

replace lock files with two-phase compaction #1141