restic / rest-server

Rest Server is a high performance HTTP server that implements restic's REST backend API.
BSD 2-Clause "Simplified" License
899 stars 139 forks source link

Backup fails with 'file content does not match hash' #247

Closed adamklaff closed 10 months ago

adamklaff commented 1 year ago

Hi,

When I back up a repo to my rest-server running in Docker, it seems to always fail eventually. Originally I was just getting the error: "Fatal: unable to save snapshot: server response unexpected: 400 Bad Request (400)" but after running rest-server in debug mode I got some more info below.

Output of rest-server --version

0384c3e2599f

My client is 'restic 0.15.1 compiled with go1.19.5 on linux/amd64' running on Unraid.

How did you run rest-server exactly?

I am on a Synology, running with docker via:

docker run --rm -v /volume1/backup/:/data -p 8000:8000/tcp restic/rest-server rest-server --debug --path /data --htpasswd-file /data/.htpasswd

restic was invoked with 'restic backup --tag 'photos' -r rest:http://restic:password@host:port/backup --verbose=2 -p resticpw /mnt/user/Photos/'

What backend/server/service did you use to store the repository?

rest-server

Expected behavior

I expect the backup to complete.

Actual behavior

In the middle of the backup, the client returns 'Fatal: unable to save snapshot: server response unexpected: 400 Bad Request (400)'

The last few lines of debug output from the server are:

POST /backup/data/3b2299d24867b28976bed07e44e44f193063c2b8b3213f401b4338b4e9c7512f saveBlob() POST /backup/data/39b8c0b87a21ceb39dd4c2fdf15bd055ac082ba1fc8178e9d62e7694de7af1f2 saveBlob() POST /backup/data/fe6262bfcfe58ad12d68c47e6066d4cc95211d08723c6fac0d67cef39afa64cf saveBlob() POST /backup/data/87bcc33ee91eee781b30520f48ca216cef58bf932419d28d75b1d9359309ff37 saveBlob() file content does not match hash DELETE /backup/data/fe6262bfcfe58ad12d68c47e6066d4cc95211d08723c6fac0d67cef39afa64cf deleteBlob() DELETE /backup/locks/443500ef7114495db1ef9c1c21e80a4286e2732d061c15e9a84bce946df9ed83 deleteBlob()

Steps to reproduce the behavior

Just run a backup. It happens on other folders as well.

Do you have any idea what may have caused this?

No, but some googling of the error revealed maybe NTP differences on different hosts could result in different hashes?

Do you have an idea how to solve the issue?

Not yet.

MichaelEischer commented 1 year ago

file content does not match hash

That error is an integrity check to ensure that pack files uploaded by restic are not corrupted (totally unrelated to NTP). It can occur if either the backup client encounters a bitflip, the data is corrupted while transferring it to the rest-server or the rest-server encounters a bitflip while verifying the pack file. Bitflips are usually caused by a bit flips in memory or in the CPU while processing data.

the client returns 'Fatal: unable to save snapshot: server response unexpected: 400 Bad Request (400)'

Could you provide the log output from the client, in particular the retrying after lines?

Judging from the debug log, I'd expect the client to have retried the failed upload multiple times, however, without the client logs this is just a guess.

Assuming my guess is correct, this means that the bitflip likely occurred at the client which tried to upload the corrupted data multiple times. If this is just a one time event, you can probably ignore it. However, you must run restic check --read-data to ensure that the repository has not been corrupted.

adamklaff commented 1 year ago

Thanks, that makes sense. I wonder if I need to do some other integrity checks on the client. This happens with almost every backup attempt.

Here's output from the client during a failure:

Save(<data/992e4e7839>) returned error, retrying after 552.330144ms: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 1.080381816s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 1.31013006s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 1.582392691s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 2.340488664s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 4.506218855s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 3.221479586s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 5.608623477s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 7.649837917s: server response unexpected: 400 Bad Request (400) Save(<data/992e4e7839>) returned error, retrying after 15.394871241s: server response unexpected: 400 Bad Request (400) Fatal: unable to save snapshot: server response unexpected: 400 Bad Request (400)

Here is the server output:

POST /backup/data/eacb1db8d6fb6fc83ab30e2e4e1e369d29a53970312465a8103e7b72903f5364 saveBlob() POST /backup/data/992e4e783954618e70d415aa8e7420ccda42f11c29898bd9200c1020d4c97d0c saveBlob() POST /backup/data/2c8e4b6e70a82a10b00bc7cf8edec82957a141c4a47a9ce79a3d9c96633e7ed3 saveBlob() POST /backup/data/a8f6daa5def316dec67a9c0a2b9b534f0394308b5e1ef92991b35f891f7206f7 saveBlob() POST /backup/data/a2e0f1e183c4815568ff29ff8d9128f87529bfadbf1d45b6c1a99453c41f41cd saveBlob() file content does not match hash DELETE /backup/data/992e4e783954618e70d415aa8e7420ccda42f11c29898bd9200c1020d4c97d0c deleteBlob() DELETE /backup/locks/6ccd40e283d859deef7e2412075624db975bf40d589f8777da9775cf478d7f7d deleteBlob() unexpected EOF

MichaelEischer commented 1 year ago

Please run some stress tests like memtest86 / prime95 to make sure the system is working correctly. That frequent data corruption usually indicates a hardware problem.

MichaelEischer commented 10 months ago

Closing as I don't think there's anything we can do here on the rest-server side.