restic / restic

Fast, secure, efficient backup program
https://restic.net
BSD 2-Clause "Simplified" License
26.53k stars 1.56k forks source link

Data from incomplete snapshot. #1966

Closed pauletg closed 6 years ago

pauletg commented 6 years ago

restic 0.9.1 compiled with go1.10.3 on darwin/amd64

I was using restic to back up a large volume of data to blackblaze. Unfortunately there was a hardware failure on the volume being backed up before the initial snapshot could complete. Is there any way to get any of my data back out of the repository now? The restic list snapshots and restic mount both seem to just hang indefinitely when I try. I am not even prompted for the password to the repo. The backup had been gracefully paused before the hardware failure, if that helps.

fd0 commented 6 years ago

Ah, hm. Do you require a specific file, or just "all the data there is"? The data is there, and restic has all the means to pull it out, but that would mean either scripting around restic (which will be very slow) or adding custom code for restic.

Honest question: how important is the data for you? I could spend some time today to hack something together for you, which should give you access to almost all data that has been uploaded to the repo, but it's probably not as "production ready" as most of the code. :)

pauletg commented 6 years ago

The data is very important to me and unfortunately there is no other copy. I know that isn't ideal, but it was a problem I was attempting to solve. I am pursuing a hardware fix to try to bring the volume back online, but that is not looking too hopeful at the moment. If there was a way for me to specify a directory and be able to access and download everything inside of that directory it would be a serious life saver. It is a lot of data (much of it is video projects), so the faster option would be preferable. That being said, I don't really know how much work it would take and I appreciate that this is free software, but I would be extremely grateful if this could be made possible.

fd0 commented 6 years ago

ok, I'll see what I can do.

pauletg commented 6 years ago

Thank you so much.

fd0 commented 6 years ago

You can start by running restic rebuild-index, so we have a fresh index covering all packs in the repo.

pauletg commented 6 years ago

Getting that started now.

mholt commented 6 years ago

Wow, that's super generous of you @fd0.

If I can help with testing anything related to this, let me know.

pauletg commented 6 years ago

Should restic rebuild-index ask me for the repo password? It has not yet. I do suspect that it may take quite a long time due to the amount of data in the repo. I am perfectly content to let it run all weekend, or in to next week if necessary. I just want to be sure it doesn't need a password from me before I leave it unattended for a long period of time.

fd0 commented 6 years ago

Hm, it should ask for a password early on. It need to decrypt all the headers of all files in the repo. Do you maybe have the environment variable RESTIC_PASSWORD exported, so it does not need a password from you?

fd0 commented 6 years ago

It should print something like this early on:

repository ed6136ad opened successfully, password is correct

At least when run interactively (no redirection of stdout to a log file).

fd0 commented 6 years ago

You can also skip the rebuild-index if the last 15 minutes of the uploaded data aren't that important, we can always do that later on.

pauletg commented 6 years ago

I do not have that the RESTIC_PASSWORD environment variable set, but I will set it and just let the command run. It did not return anything for about 10 minutes, so I gave it a ctrl-c and tried again. My syntax is correct, right? restic -r b2:MY_BUCKET_NAME:/ rebuild-index In any case, the last 15 minutes of data should be very small compared to the total uploaded data, so I would be perfectly happy coming back to this later.

fd0 commented 6 years ago

ok, then don't run rebuild-index just yet, so we can try out the recover code :)

fd0 commented 6 years ago

I've pushed a commit to the branch recover-data, just build restic (go run build.go) and call it like this:

$ restic -r b2:MY_BUCKET_NAME:/ recover

It should then list all trees in the repo, find the root trees, and create a new snapshot referencing all root trees:

repository abe002d6 opened successfully, password is correct
load index files
load 543 trees
tree (543/543)
done
found 2 roots
save tree with 2 nodes
saved new snapshot 26f25bf1

Then you have a snapshot (26f25bf1 in this case) that you can restore to, or just use restic mount to browse around in it. You can also just list it:

$ restic ls -l 26f25bf1 /
repository abe002d6 opened successfully, password is correct
snapshot aac6d0ed of [/recover] filtered by [/] at 2018-08-23 22:23:56.903268714 +0200 CEST):
drwxr-xr-x     0     0      0 2018-08-23 22:23:56 /0b9e25fb
drwxr-xr-x     0     0      0 2018-08-23 22:23:56 /d0d9386a

The top-level directories are named after the tree IDs, so they are a bit cryptic, but the next level down has normal names.

Let me know if that helps you!

fd0 commented 6 years ago

So, to add a bit of background story: I read the github issue, then went for a shower, and thought to myself, "hm, that's not so hard to do". Turns out, I was right, and it wasn't. If this functionality is helpful for others, we can turn it into a proper command later, but for now I hope it works for you and you can access the data that's already uploaded to B2.

How much data was it? How much did you recover?

Good luck!

pauletg commented 6 years ago

Wow! That was fast! Thank you so much! I just cloned the repo and am going to attempt to build it now. I also figured out why rebuild-index wasn't working. It was a DNS issue on the network that our server is on. I fixed that and got Fatal: unable to create lock in backend: repository is already locked by PID 41208 so apparently my upload did not get stopped gracefully after all. The unlock command seems to have cleared that up and rebuild-index is currently running.

I will get as much done on this a I can today, but I am leaving for northern Michigan for the weekend in about 15 minutes and I don't think my internet access will be very good there. This will get my full attention on Monday when I am back and I will get you more details :-)

Thank you so much for this! Sorry to leave you hanging, but I will be in touch on Monday.

mholt commented 6 years ago

If this functionality is helpful for others, we can turn it into a proper command later

I would love to help with this in any way that I can. My upload speed here is 1 Mbps and so my initial backups can take up to 3-6 months. Having a way to restore before it completes would be an excellent feature, especially if it's not too difficult, as you say. Let me know how I can be of service! Thank you so much for working on this! :D

Also, your solution is quite brilliant I think. Elegant and fairly simple.

fd0 commented 6 years ago

Sorry to leave you hanging, but I will be in touch on Monday.

Don't worry about that, I'm just curious if it works :sunglasses:

The data is safe at B2 and won't go away. Even the recover command won't change any data, it will just read it, add another file and a snapshot, and that's it.

So, to give you a bit of background (maybe I'll expand this into a blog post later): Under the hood, a restic repository contains different types of files, for example snapshot and data files:

When a file is saved with restic, it is cut into data blobs, which are collected and saved together in one or more files in the repository. The name of the file together with the list of references (IDs) to the data blobs is then saved in a tree. When restic is done archiving the directory, the list of files (names and references for data blobs) is saved as a tree blob into another data file.

For sub-directories, restic stores the name of the sub-directory together with the reference of the tree blob describing the contents in another tree.

At the end of the restic backup run we have a root tree that isn't referenced by any other tree, but contains all references to all top-level trees and therefore (indirectly) to all files and sub-dirs in the backup. As the last step, restic creates a new snapshot file which references the root tree.

If you tell restic to forget a particular snapshot, the root tree is not referenced any more. restic prune detects this and removes the tree and all other unneeded tree and data blobs.

In general, a tree is only saved in the repository when all files and subdirs in it have been successfully saved. So as soon as a tree blob is there, we can assume that the data it references (including subdirs) is also there.

When restic is aborted during backup, there will be a bunch of tree blobs in the repo, together with the data in the files they reference. So for recovering the data, restic only needs to do the following:

Next, go through the list of trees again, throw away all the ones we have seen references for. The remaining trees are the root trees, which means either trees that are (or have been) directly referenced by a snapshot, or which are "dangling" as the result of an aborted run of restic backup.

As the last step, create a new tree which lists all the root trees, save it to the repo, then create a new snapshot which references this new tree.

You can then just use this new snapshot normally, except for the cryptic names of the directories (which are just the short tree IDs of the root trees we found).

Before merging this to master, I think we should do the following:

pauletg commented 6 years ago

Small update on this: I started a rebuild-index before I left last Thursday. That died before I got back with a read: connection reset by peer. I restarted it yesterday with a higher number parallel connections to b2 and it seems to be running fine. It is only at 5% now, but I expect it to take a while. The b2 bucket has about 90TB in it and the directories I was backing up probably had about 110-120TB in them.

I am honestly very impressed that restic stayed so stable during the upload. I tried cloudberry for Mac before trying restic and I wasn't able to get it to work with that much data. I use restic to backup my laptop at home and I love it, so I thought I would give it a shot for this . Since I haven't even finished my initial upload, I have no idea how something like a prune will go, but I will be happy to keep you updated if you need data on how restic behaves with large volumes of data. If I can get it to complete all of the operations required to maintain a weekly backup in under a week I think it will be a great candidate for handling these backups.

For the time being I have a few questions: Should I let this rebuild-index run to completion before I try a recover? Will I loose anything if I don't? I have been thinking about this and I think I would like to recover as much as possible on the first go if possible since things take a while to run on this much data, but if it is better to kill this, run recover first and rebuild-index later, I can do that. Will running rebuild-index or recover with a --quiet flag speed things up like it does with the backup command?

fd0 commented 6 years ago

Ok, cool! I would recommend doing the following:

If you'd like to try, you can then run rebuild-index again and scrape the remaining megabytes of data from the repo. Probably that'll be less than a few hundred megabytes, and it's likely that this won't reveal any new data not yet contained in the snapshot. But you can try it :)

While rebuild-index is running, you can't access the repository.

Will running rebuild-index or recover with a --quiet flag speed things up like it does with the backup command?

Nope.

pauletg commented 6 years ago

I let things run overnight and it appears to have filled the hard drive and failed:

found 755 roots Fatal: unable to save new tree to the repo: fs.TempFile: open /var/folders/tq/67qp8py137n_5nzf563qlylr0000gn/T/restic-temp-pack-913168611: no space left on device

Is there an easy way to tell how much data this operation will have to download or a way to reduce how much it downloads?

pauletg commented 6 years ago

Also, if I want to free up this disk space is restic cache --cleanup the way to do it?

fd0 commented 6 years ago

No, that only removes cache directories which haven't been used for 30 days. Just remove the cache directory, which should be somewhere in your home directory.

Which command did you run exactly? Both rebuild-index and recover should not save much data on the hard disk, except for the metadata cache...

pauletg commented 6 years ago

Not at my desk at the moment. but it was something like: ./restic -o b2.connections=x -r b2:mybucket:/ recover I think I had x set to something huge. That may have been part of the problem. I can restart it without the -o b2.connections=x bit. I found the cache directory and deleted that.

Nican commented 6 years ago

First of all, awesome tool. I also need to backup terabytes of data over a slow-ish connection and have a chance of the backup failing while still not complete. Is there a recommended way to backup only a few files at a time?

fd0 commented 6 years ago

Is there a recommended way to backup only a few files at a time?

What usually works (I've heard) is saving individual parts of the source data (e.g. single directories) and when that is complete saving all directories together. When the source data hasn't changed, restic should upload almost nothing due to the builtin deduplication.

A better place for such questions would be the forum at https://forum.restic.net, the question (and answers!) are much more discoverable there.

fd0 commented 6 years ago

@pauletg so, how did it go?

fd0 commented 6 years ago

I've proposed the new command recover in #2056.

pauletg commented 6 years ago

I haven't made much headway on the restic recover since my last post. The good news is that we have managed to revive the server and the data wasn't harmed in the crash, so I have my data. The recover command seems to fill the HD of my machine, before it could complete. This could have been caused by several factors: My backup was huge and I was using a large number of connections to b2 for the upload and the HD on the machine I was using for the recover was relatively small. I am sure it would probably work great if my backup was a more reasonable size. Let me know if there is any more information that would be helpful to you. I really appreciate you working on this and having this feature available for my laptop backups is really nice.

fd0 commented 6 years ago

Thanks for the feedback! If you like (and have a lot of time), you could retry this with --no-cache, but that'll take even longer. I'll close this issue when #2056 is merged.

Please let us know if you have additional feedback! :)