restic / restic

Fast, secure, efficient backup program
https://restic.net
BSD 2-Clause "Simplified" License
25.68k stars 1.53k forks source link

Write: sftp: "Failure" (SSH_FX_FAILURE) when the remote storage is full #3336

Closed Krumbelfix closed 1 year ago

Krumbelfix commented 3 years ago

Output of restic version

restic 0.12.0 compiled with go1.15.8 on linux/amd64

How did you run restic exactly?

restic -r sftp:root@<ip-adress-snipped>:/mnt/backup-docker/ --verbose prune --password-file=/root/.restic/backup-docker.pwd

What backend/server/service did you use to store the repository?

Scaleway root-server via SFTP

Expected behavior

cleaning up / prune snapshots so the backup is working again

Actual behavior

restic -r sftp:root@<ip-adress-snipped>:/mnt/backup-docker/ --verbose check --password-file=/root/.restic/backup-docker.pwd

``` pack f399332c: not referenced in any index [...] A lot of not referenced index files [...] pack dfe443af: not referenced in any index 1585 additional files were found in the repo, which likely contain duplicate data. You can run `restic prune` to correct this. ```

restic -r sftp:root@<ip-adress-snipped>:/mnt/backup-docker/ --verbose prune --password-file=/root/.restic/backup-docker.pwd

``` repository 92246e2a opened successfully, password is correct counting files in repo building new index for repo [8:50] 100.00% 63722 / 63722 packs incomplete pack file (will be removed): 17a852a13c08e04fd541e43a593bbe94feba8b77768953a405ccedacfc1648bb incomplete pack file (will be removed): cf8895aa47d5e19072efeb07f57971bdb9ad929dda3af381d50bd1c14900a44c [...] A lot of incomplete pack files [...] repository contains 63681 packs (1347596 blobs) with 294.325 GiB processed 1347596 blobs: 105577 duplicate blobs, 87.518 GiB duplicate load all snapshots find data that is still in use for 202 snapshots [3:11] 18.81% 38 / 202 snapshots [3:13] 19.31% 39 / 202 snapshots [3:51] 60.89% 123 / 202 snapshots [4:02] 96.04% 194 / 202 snapshots [4:04] 100.00% 202 / 202 snapshots found 934858 of 1347596 data blobs still in use, removing 412738 blobs will remove 41 invalid files will delete 35672 packs and rewrite 346 packs, this frees 255.849 GiB Save() returned error, retrying after 598.193345ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 850.463688ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.648746395s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.81251255s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 3.385001074s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 2.114282783s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 7.109876912s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 5.809616247s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 17.617471184s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 12.153765673s: Write: sftp: "Failure" (SSH_FX_FAILURE) sftp: "Failure" (SSH_FX_FAILURE)itten Write github.com/restic/restic/internal/backend/sftp.(*SFTP).Save github.com/restic/restic/internal/backend/sftp/sftp.go:326 github.com/restic/restic/internal/limiter.rateLimitedBackend.Save github.com/restic/restic/internal/limiter/limiter_backend.go:30 github.com/restic/restic/internal/backend.(*RetryBackend).Save.func1 github.com/restic/restic/internal/backend/backend_retry.go:56 github.com/cenkalti/backoff.RetryNotify github.com/cenkalti/backoff/retry.go:37 github.com/restic/restic/internal/backend.(*RetryBackend).retry github.com/restic/restic/internal/backend/backend_retry.go:36 github.com/restic/restic/internal/backend.(*RetryBackend).Save github.com/restic/restic/internal/backend/backend_retry.go:50 github.com/restic/restic/internal/cache.(*Backend).Save github.com/restic/restic/internal/cache/backend.go:54 github.com/restic/restic/internal/repository.(*Repository).savePacker github.com/restic/restic/internal/repository/packer_manager.go:106 github.com/restic/restic/internal/repository.(*Repository).SaveAndEncrypt github.com/restic/restic/internal/repository/repository.go:270 github.com/restic/restic/internal/repository.(*Repository).SaveBlob github.com/restic/restic/internal/repository/repository.go:704 github.com/restic/restic/internal/repository.Repack github.com/restic/restic/internal/repository/repack.go:87 main.pruneRepository github.com/restic/restic/cmd/restic/cmd_prune.go:278 main.runPrune github.com/restic/restic/cmd/restic/cmd_prune.go:85 main.glob..func18 github.com/restic/restic/cmd/restic/cmd_prune.go:25 github.com/spf13/cobra.(*Command).execute github.com/spf13/cobra/command.go:762 github.com/spf13/cobra.(*Command).ExecuteC github.com/spf13/cobra/command.go:850 github.com/spf13/cobra.(*Command).Execute github.com/spf13/cobra/command.go:800 main.main github.com/restic/restic/cmd/restic/main.go:86 runtime.main /usr/lib/go-1.12/src/runtime/proc.go:200 runtime.goexit /usr/lib/go-1.12/src/runtime/asm_amd64.s:1337 ```

restic -r sftp:root@<ip-address-snipped>:/mnt/backup-docker/ --verbose backup /var/lib/docker/ --password-file=/root/.restic/backup-docker.pwd

``` open repository repository 92246e2a opened successfully, password is correct lock repository load index files using parent snapshot f1a93c2b start scan on [/var/lib/docker/] start backup on [/var/lib/docker/] Save() returned error, retrying after 437.171317ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 423.030259ms: Write: sftp: "Failure" (SSH_FX_FAILURE)eb6e00c8ca55001e77eb36ffc99d5791a179e3cf780aa1c7-json.log Save() returned error, retrying after 412.66944ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 729.638889ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 554.948931ms: Write: sftp: "Failure" (SSH_FX_FAILURE)eb6e00c8ca55001e77eb36ffc99d5791a179e3cf780aa1c7-json.log Save() returned error, retrying after 1.091541757s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.512941144s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 725.284281ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.546249696s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 2.217300829s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 442.45257ms: Write: sftp: "Failure" (SSH_FX_FAILURE)937b15c2063eb34b23702c1f875e4b90d15e8ceb2aa8f60f1-json.log Save() returned error, retrying after 2.492482153s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 752.170907ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.567900755s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 866.326182ms: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 2.74188537s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.140151404s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 3.288696345s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 2.406437136s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 1.664638098s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 2.51257952s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 4.584168338s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 4.291992227s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 4.867318033s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 5.731181313s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 3.61222373s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 6.138326764s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 4.545179123s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 7.009085549s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 10.558950745s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 7.730524827s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 10.06168457s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 17.136020774s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 18.862930378s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 11.0497682s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 11.976267608s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 20.965183681s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 19.151448088s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 17.99834434s: Write: sftp: "Failure" (SSH_FX_FAILURE) Save() returned error, retrying after 18.883270334s: Write: sftp: "Failure" (SSH_FX_FAILURE) Fatal: unable to save snapshot: Write: sftp: "Failure" (SSH_FX_FAILURE) ```

Steps to reproduce the behavior

unfortunately i can't provide any useful information :(

Do you have any idea what may have caused this?

no :(

Do you have an idea how to solve the issue?

and an aditional "no" :)

Did restic help you today? Did it make you happy in any way?

I really, really love restic and its design principles. Its an great software which helps me a lot. <3

Krumbelfix commented 3 years ago

I just found the solution: The Mount was full.

after increasing the storage capacity i can prune the backup:

on the backupserver: lvextend --resizefs -L 310G /dev/backup/lv-docker-backup

``` Size of logical volume backup/lv-docker-backup changed from 300.00 GiB (76800 extents) to 310.00 GiB (79360 extents). Logical volume backup/lv-docker-backup successfully resized. resize2fs 1.44.1 (24-Mar-2018) Filesystem at /dev/mapper/backup-lv--docker--backup is mounted on /mnt/backup-docker; on-line resizing required old_desc_blocks = 38, new_desc_blocks = 39 The filesystem on /dev/mapper/backup-lv--docker--backup is now 81264640 (4k) blocks long. ```

restic -r sftp:root@<snipped-ip-adress>:/mnt/backup-docker/ --verbose prune --password-file=/root/.restic/backup-docker.pwd

``` counting files in repo building new index for repo [9:13] 100.00% 63722 / 63722 packs incomplete pack file (will be removed): 17a852a13c08e04fd541e43a593bbe94feba8b77768953a405ccedacfc1648bb incomplete pack file (will be removed): 47b15b27789be8c5f2c063a285d79398123d2172e1a93ef9d99ab2c7b03670c2 [ more incomplete packs.....] incomplete pack file (will be removed): 6ce66ef6e87d07f31795443a4b7e4d807ae61199ce23ea1b0e9b02d4af81c755 incomplete pack file (will be removed): 6c0912ff422b9c5afb918e989465b7bf9aeaa360885b53bd872ec60c3554f132 repository contains 63681 packs (1347596 blobs) with 294.325 GiB processed 1347596 blobs: 105577 duplicate blobs, 87.518 GiB duplicate load all snapshots find data that is still in use for 202 snapshots [4:02] 100.00% 202 / 202 snapshots found 934858 of 1347596 data blobs still in use, removing 412738 blobs will remove 41 invalid files will delete 35672 packs and rewrite 346 packs, this frees 255.849 GiB [28:55] 82.08% 284 / 346 packs rewritten [33:59] 100.00% 346 / 346 packs rewritten counting files in repo [3:42] 84.09% 23455 / 27893 packs [3:43] 84.27% 23506 / 27893 packs [4:04] 92.13% 25697 / 27893 packs [4:24] 100.00% 27893 / 27893 packs finding old index files saved new indexes as [3af455a8 4dc3b8a7 a70e17bc 5e6b7c82 8fe56c5b c77c745e c729aeb0 5aa4a61f fd6af3b6 c4b9fbe2] remove 418 old index files [0:06] 0.73% 263 / 36018 packs deleted [4:52] 34.83% 12544 / 36018 packs deleted [11:30] 83.16% 29953 / 36018 packs deleted [13:11] 95.55% 34415 / 36018 packs deleted [13:47] 100.00% 36018 / 36018 packs deleted done ```

As an improvement: an warning or hint when the storage is (nearly) full would be very useful.

MichaelEischer commented 3 years ago

Hmm, the generic error message really doesn't help. I thought there has been an older issue with the same problem, but couldn't find it. To provide any useful error message we'd need to find a way to detect what the problem is, "failure" is way to generic.

greatroar commented 3 years ago

The problem is in OpenSSH's sftp-server program, which sends a generic failure code for ENOSPC with no way to retrieve the cause, even though the protocol has a code reserved for this case. restic could do a StatVFS before writing a file, as a heuristic.

ajsicek commented 1 year ago

Sorry if I understood wrong how this was closed. I experienced the same error msg today, by googling I got here and indeed the mount was full. Better error msg would be helpful :)

greatroar commented 1 year ago

@ajsicek restic 0.12.1 got a best-effort fix for this issue. If you still get it from a newer restic version, please report that version and information about the SSH implementation on your server. Some just send very generic error codes that restic can't figure out.

ajsicek commented 1 year ago

hi, not sure exactly what info is relevant both are debian stable servers most of the software is the same Debian 5.10.149-2 (2022-10-21) x86_64 GNU/Linux restic 0.14.0 compiled with go1.19 on linux/amd64

if I run the connection over which I tried the backup eith ssh -v I get debug1: Local version string SSH-2.0-OpenSSH_8.4p1 Debian-5+deb11u1 debug1: Remote protocol version 2.0, remote software version OpenSSH_8.4p1 Debian-5+deb11u1

greatroar commented 1 year ago

That's a recent enough OpenSSH. When you sftp to the server, what do df and df -i say? For reference, this is what restic checks before generating a "no free space left on device" message:

https://github.com/restic/restic/blob/4ba31df08f262b6cbd4ac78846175d5b599f3efb/internal/backend/sftp/sftp.go#L406

That's "no free inodes or block size * free blocks < size of file we're uploading". It might be that you're hitting ENOSPC earlier than that. In particular, some of the space on the server may be reserved for the root user.

Also, try running restic in debug mode and see if a message starting "sftp: StatVFS returned" occurs in the log.

MichaelEischer commented 1 year ago

That's "no free inodes or block size * free blocks < size of file we're uploading". It might be that you're hitting ENOSPC earlier than that. In particular, some of the space on the server may be reserved for the root user.

Isn't there usually a small reserved amount of storage which is only writable by the root user? That reserved storage would probably also be reported as free space?

ajsicek commented 1 year ago

to be able to do the backups. I resized the disk, so not sure, how much I can investigate any further :( df on server now says /dev/sdb1 131364796 68578612 56080404 56% /mnt/data_alfa df -i says /dev/sdb1 8364032 5164 8358868 1% /mnt/data_alfa

when the backup failed df showed "use 100%" df -i I did not check

in auth.log on the server there is msg from that backup time sftp-server 1715951 - - error: process_write: write: No space left on device

anything else I can dig out of logs?

greatroar commented 1 year ago

Isn't there usually a small reserved amount of storage which is only writable by the root user? That reserved storage would probably also be reported as free space?

Yes, and I just reproduced the behavior by backing up to an 8MiB image with 50% reserved blocks. sftp.StatVFS.FreeSpace counts the total free space, but the struct has the required information about reserved blocks. I suppose we can assume the user is not root on the server?

ajsicek commented 1 year ago

I suppose we can assume the user is not root on the server?

yes that is correct

greatroar commented 1 year ago

@ajsicek I meant in general, I would not address you as "the user" :) #4075 should fix this.