restic / restic

Fast, secure, efficient backup program
https://restic.net
BSD 2-Clause "Simplified" License
25.68k stars 1.53k forks source link

restic prune: `Fatal: index is not complete` #3999

Closed lbausch closed 1 year ago

lbausch commented 1 year ago

Output of restic version

restic 0.14.0-dev (compiled manually) compiled with go1.18 on linux/amd64

How did you run restic exactly?

$ restic --cacert /path/to/file.pem -v prune --max-unused 0
repository f72c6108 opened (version 2, compression level auto)
loading indexes...
loading all snapshots...
finding data that is still in use for 18 snapshots
[11:26] 100.00%  18 / 18 snapshots
searching used packs...
{<data/[null]> <data/00000000>} not found in the index

Integrity check failed: Data seems to be missing.
Will not start prune to prevent (additional) data loss!
Please report this error (along with the output of the 'prune' run) at
https://github.com/restic/restic/issues/new/choose
Fatal: index is not complete

Environment variables:

RESTIC_REPOSITORY=s3:https://<redacted>
RESTIC_PASSWORD=<redacted>
AWS_ACCESS_KEY_ID=<redacted>
AWS_SECRET_ACCESS_KEY=<redacted>

What backend/server/service did you use to store the repository?

S3

Expected behavior

prune finishes successfully

Actual behavior

prune aborts with an error

Steps to reproduce the behavior

As of now, I'm not aware on how to consistently reproduce this issue.

Do you have any idea what may have caused this?

Previously, a backup run was terminated by the oom killer because the machine went out of RAM.

Do you have an idea how to solve the issue?

There are similiar bug reports for this kind of issue, but I'm not sure if the described solutions (e.g. remove affected snapshot) apply here as well - hence this issue. :)

Did restic help you today? Did it make you happy in any way?

restic is just awesome πŸ₯³

rawtaz commented 1 year ago

Can you please run a check --read-data to check the integrity of the entire repository?

lbausch commented 1 year ago

The repository is about 5 TiB in size, so this took some time - but this is the output:

restic check output ``` $ restic --cacert /path/to/file.pem -v check --read-data using temporary cache in /tmp/restic-check-cache-1594464398 repository f72c6108 opened (version 2, compression level auto) created new cache in /tmp/restic-check-cache-1594464398 create exclusive lock for repository load indexes check all packs pack 3662be081b3bcf9c9a0140edcd5b721c3eee52b83e28c9abe3cbda26b69460a7: not referenced in any index pack 41114b4b6ceb5877b406bae7a34c8c8eb82b262f3bd76f3b3e0fcc9b9c31b43e: not referenced in any index pack 8661277be19f09a0852fad1cce3f08a5144214660e7e3e9f06d145392a5ef877: not referenced in any index pack 72cdbd220eaeb37aea39e47c3c47023b254e6ace7f3aa45b2d00df3df075693c: not referenced in any index pack 9d84915bc1bb10d97d08f4974c1244896fc42acbc4830632e14348acdb4b7d63: not referenced in any index pack 07ca3bf0b5747cfa3384e2c274262fc31ed67e30546aeed6c9c9f14d9df0d240: not referenced in any index 6 additional files were found in the repo, which likely contain duplicate data. This is non-critical, you can run `restic prune` to correct this. check snapshots, trees and blobs error for tree adfa7137:pshots tree adfa713766d3155461a2eaaa9908e5c29b1bfa8c1bcd3dc73e6511974eecd00d: file "" blob 4862 has null ID error for tree f90cdfd2:pshots tree f90cdfd2cca452a7d88218ac2c1319ed1baa111a85aa3466159b08d94c7e3034: file "" blob 0 has null ID error for tree 3189911b:napshots tree 3189911b41901f4b9079b6ed0c96aa0bd8b55ba50e1fe9a7b5a4ec4a45f4dfd8: file "" blob 00000000000000000000000000000000999e41f322e7ea9e7136f52e96870959 not found in index error for tree 60f48a47:napshots tree 60f48a473dd1055fffc291666651011c4f89416445a73e2f6598752b5c0b2751: file "" blob 2 has null ID tree 60f48a473dd1055fffc291666651011c4f89416445a73e2f6598752b5c0b2751: file "" blob 2 has null ID error for tree 2358e816:napshots tree 2358e816c3485b3abb7714cf857112998b6cf37f14d579c12f3090b03d04776a: file "" blob 0 has null ID error for tree 7ca9f766:napshots tree 7ca9f76682d17c6179c71c3ffc3ebfc7aedb23c3e6127fc87f6bb5798ef8c131: file "" blob 252 has null ID tree 7ca9f76682d17c6179c71c3ffc3ebfc7aedb23c3e6127fc87f6bb5798ef8c131: file "" blob 255 has null ID [2:48] 100.00% 18 / 18 snapshots read all data [9:55:23] 100.00% 560896 / 560896 packs Fatal: repository contains errors ```

I had to redact some filenames, though.

MichaelEischer commented 1 year ago

How does this issue relate to https://forum.restic.net/t/restic-check-dealing-with-error-for-tree-blob-has-null-id/5608 ? How often do you run check? When did the first error show up? Do you have the error output (if any) of restic when it was oom-killed?

Please use restic find --tree adfa713766d3155461a2eaaa9908e5c29b1bfa8c1bcd3dc73e6511974eecd00d f90cdfd2cca452a7d88218ac2c1319ed1baa111a85aa3466159b08d94c7e3034 3189911b41901f4b9079b6ed0c96aa0bd8b55ba50e1fe9a7b5a4ec4a45f4dfd8 60f48a473dd1055fffc291666651011c4f89416445a73e2f6598752b5c0b2751 2358e816c3485b3abb7714cf857112998b6cf37f14d579c12f3090b03d04776a 7ca9f76682d17c6179c71c3ffc3ebfc7aedb23c3e6127fc87f6bb5798ef8c131 to find out which snapshots are affected. When were these snapshots created? Do you know which restic version you used at the time?

I wonder whether this could be related to https://github.com/restic/restic/pull/3955, which changes code in a place that could result in the error reported by check. However, the basic accounting there appears to be correct, thus I don't see how it could produce this result

MichaelEischer commented 1 year ago

Could you also upgrade you go version? go1.18 has received several bug fixes by now. Or just switch to go 1.19.

lbausch commented 1 year ago

How does this issue relate to https://forum.restic.net/t/restic-check-dealing-with-error-for-tree-blob-has-null-id/5608 ?

This is another repository were I encountered this kind of error messages, but when running restic check. I just found out that restic prune aborts with the same error message, so it's probably the same underlying error. Sorry for bringing up the same thing twice in difference places.

How often do you run check?

We run restic check once a week.

When did the first error show up?

After the oom-ed backup, we upgraded restic to the latest then available beta build, to see if it would use less memory. I just found the following messages in the log of the first backup run of the beta build:

...
error: parts of <redacted> not found in the repository index; storing the file again
error: parts of <redacted> not found in the repository index; storing the file again
error: parts of <redacted> not found in the repository index; storing the file again
...

I had to redact the filenames again, but they exactly match the ones restic check complains about. Despite I'm not entirely 100 % sure, it looks like errors showed up once the beta build was installed.

Do you have the error output (if any) of restic when it was oom-killed?

Unfortunately there's not much to see, restic just received SIGKILL, the last message in our log is start backup on [...]

Output of restic find --tree ``` restic --cacert /path/to/file.pem find --tree adfa713766d3155461a2eaaa9908e5c29b1bfa8c1bcd3dc73e6511974eecd00d f90cdfd2cca452a7d88218ac2c1319ed1baa111a85aa3466159b08d94c7e3034 3189911b41901f4b9079b6ed0c96aa0bd8b55ba50e1fe9a7b5a4ec4a45f4dfd8 60f48a473dd1055fffc291666651011c4f89416445a73e2f6598752b5c0b2751 2358e816c3485b3abb7714cf857112998b6cf37f14d579c12f3090b03d04776a 7ca9f76682d17c6179c71c3ffc3ebfc7aedb23c3e6127fc87f6bb5798ef8c131 repository f72c6108 opened (version 2, compression level auto) created new cache in /root/.cache/restic Found tree 60f48a473dd1055fffc291666651011c4f89416445a73e2f6598752b5c0b2751 ... path ... in snapshot a4e98c7d (2022-11-02 08:32:46) Found tree 3189911b41901f4b9079b6ed0c96aa0bd8b55ba50e1fe9a7b5a4ec4a45f4dfd8 ... path ... in snapshot a4e98c7d (2022-11-02 08:32:46) Found tree 7ca9f76682d17c6179c71c3ffc3ebfc7aedb23c3e6127fc87f6bb5798ef8c131 ... path ... in snapshot a4e98c7d (2022-11-02 08:32:46) Found tree 2358e816c3485b3abb7714cf857112998b6cf37f14d579c12f3090b03d04776a ... path ... in snapshot a4e98c7d (2022-11-02 08:32:46) Found tree f90cdfd2cca452a7d88218ac2c1319ed1baa111a85aa3466159b08d94c7e3034 ... path ... in snapshot c4ff1732 (2022-11-02 19:31:03) Found tree adfa713766d3155461a2eaaa9908e5c29b1bfa8c1bcd3dc73e6511974eecd00d ... path ... in snapshot c4ff1732 (2022-11-02 19:31:03) ```

When were these snapshots created? Do you know which restic version you used at the time?

All snapshots before c4ff1732 were created with the stable release 0.14.0. After the oom occurred I installed restic_v0.14.0-235-g92df039e5_linux_amd64. Snapshot c4ff1732 was created on 2022-11-02 19:31:03.

In the meantime I have a few other machines where restic check also brings up the blob has null ID message. All of them run a restic beta release, obtained from beta.restic.net, but not jumping to conclusions yet. :)

Could you also upgrade your go version?

The binaries were obtained from beta.restic.net which uses 1.18 πŸ€” Should I build a binary myself using 1.19?

rawtaz commented 1 year ago

Can't this be a hardware issue, bad memory etc?

lbausch commented 1 year ago

Can't this be a hardware issue, bad memory etc?

Not ruling this out, but the machines in question are VMs running on different HVs, so probably no (I guess). So far there's kinda strong correlation between those messages and the running beta build πŸ€”

MichaelEischer commented 1 year ago

All snapshots before c4ff1732 were created with the stable release 0.14.0. After the oom occurred I installed restic_v0.14.0-235-g92df039e5_linux_amd64. Snapshot c4ff1732 was created on 2022-11-02 19:31:03.

That is a4e98c7d was created using restic 0.14.0? Judging from the warnings during the first beta backup run error: parts of <redacted> not found in the repository index; storing the file again, this means that 0.14.0 is also affected by the null blob id problem, as that error will be triggered by null blob ids.

The symptoms look a bit like some kind of memory corruption, although that doesn't really help with finding the cause of the problem.

The binaries were obtained from beta.restic.net which uses 1.18 thinking Should I build a binary myself using 1.19?

I guess we should upgrade that ^^ . Yes, please try a build using go 1.19. I've also just merged #4010 which should cause backups to panic when they would create files using the null blob id.

lbausch commented 1 year ago

After double checking our logs I stand corrected, a4e98c7d was in fact already created by the then available beta build of 0.14.0. I've now built restic using restic/builder (which has Go 1.19) and started another backup run. Regarding the failing prune: What's the best way of handling this, forgetting all the snapshots reported by restic find --tree?

MichaelEischer commented 1 year ago

After double checking our logs I stand corrected, a4e98c7d was in fact already created by the then available beta build of 0.14.0.

That's good to know, as it at least slightly limits the possible regression window.

Regarding the failing prune: What's the best way of handling this, forgetting all the snapshots reported by restic find --tree?

Yes, it's either forgetting the damaged snapshots or rewriting them using the repair PR.

lbausch commented 1 year ago

ok, now while the first backup run succeeded, this one failed:

open repository
lock repository
using parent snapshot e80302d1
load index files
start scan on [/ /foo /bar]
start backup on [/ /foo /bar]
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: lstat <redacted>: no such file or directory
error: NodeFromFileInfo: xattr.list <redacted>: no such file or directory
scan: lstat <redacted>: no such file or directory
scan: lstat <redacted>: no such file or directory
scan finished in 11090.200s: 18448691 files, 3.583 TiB
panic: completed file with null ID

goroutine 144 [running]:
github.com/restic/restic/internal/archiver.(*FileSaver).saveFile.func1()
    /restic/internal/archiver/file_saver.go:133 +0x2ad
github.com/restic/restic/internal/archiver.(*FileSaver).saveFile.func3({{0x85, 0xc7, 0x7, 0x13, 0xe3, 0x84, 0x39, 0x4b, 0x70, 0xc3, ...}, ...})
    /restic/internal/archiver/file_saver.go:215 +0xe9
github.com/restic/restic/internal/archiver.(*BlobSaver).worker(0x532f3c3037333533?, {0x11fcb50, 0xc000132cc0}, 0xc0451300c0)
    /restic/internal/archiver/blob_saver.go:100 +0x1ed
github.com/restic/restic/internal/archiver.NewBlobSaver.func1()
    /restic/internal/archiver/blob_saver.go:33 +0x29
golang.org/x/sync/errgroup.(*Group).Go.func1()
    /home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
    /home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:72 +0xa5
lbausch commented 1 year ago

another backup run, slightly different panic:

open repository
lock repository
using parent snapshot e80302d1
load index files
start scan on [/ /foo /bar]
start backup on [/ /foo /bar]
error: lstat /proc/1405818: no such file or directory
error: lstat /proc/1406250: no such file or directory
error: lstat /proc/1406261: no such file or directory
scan finished in 8489.563s: 18413327 files, 3.522 TiB
error: lstat <redacted>: no such file or directory
panic: completed file with null ID

goroutine 110 [running]:
github.com/restic/restic/internal/archiver.(*FileSaver).saveFile.func1()
        /restic/internal/archiver/file_saver.go:133 +0x2ad
github.com/restic/restic/internal/archiver.(*FileSaver).saveFile.func3({{0xfc, 0x4, 0x82, 0xe3, 0x68, 0xb1, 0xa, 0xa4, 0xc8, 0x19, ...}, ...})
        /restic/internal/archiver/file_saver.go:215 +0xe9
github.com/restic/restic/internal/archiver.(*BlobSaver).worker(0xc7bda32d5d2ec912?, {0x11fcb50, 0xc000ec42c0}, 0xc051aed3e0)
        /restic/internal/archiver/blob_saver.go:100 +0x1ed
github.com/restic/restic/internal/archiver.NewBlobSaver.func1()
        /restic/internal/archiver/blob_saver.go:33 +0x29
golang.org/x/sync/errgroup.(*Group).Go.func1()
        /home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
        /home/build/go/pkg/mod/golang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:72 +0xa5
MichaelEischer commented 1 year ago

Thanks for the tests. I think I've found the problem and fixed it in PR #4019.

Each file node contains a Content slice which is grown incrementally inside the FileSaver. To grow the slice, the old content is copied to a new larger array. Whenever a blob has been uploaded, then it gets stored at the corresponding index in the Content slice. Both actions run concurrently, and require synchronization to work correctly. However, only storing the blob ID was synchronized and not the slice growing.

Knowing the problem cause both the (more frequent) null IDs and the partially overwritten 00000000000000000000000000000000999e41f322e7ea9e7136f52e96870959 ID (the ID was stored after the first half was already copied) make perfect sense.

lbausch commented 1 year ago

Perfect, thank you very much πŸ‘πŸ» Building #4019 right now and testing backup once finished.

lbausch commented 1 year ago

Backup runs succeeded without panic πŸ‘πŸ»