vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

Error thrown for empty files during backup #4183

Open felfa01 opened 3 years ago

felfa01 commented 3 years ago

What steps did you take and what happened: Backups are marked as PartiallyFailed after Warning: failed to read all source data during backup\n: exit status 3 is raised during backup (with --use-restic) of Azure FileShares.

After some troubleshooting I have identified that the failed read of source data all points towards empty files (0B) within the Azure FileShare. Appears that velero restic backups isn't able to properly handle empty files.

What did you expect to happen: Files that are empty does not cause PartiallyFailed error for backups. They should e.g. be skipped with a Warning.

Part of error log describing error

time="2021-09-22T08:47:55Z" level=info msg="Backing up item" backup=velero/daily-backup logSource="pkg/backup/item_backupper.go:121" name=redis-data-redis-replicas-0 namespace=backup-ns resource=persistentvolumeclaims
time="2021-09-22T08:47:55Z" level=info msg="Executing custom action" backup=velero/daily-backup logSource="pkg/backup/item_backupper.go:327" name=redis-data-redis-replicas-0 namespace=backup-ns resource=persistentvolumeclaims
time="2021-09-22T08:47:55Z" level=info msg="Executing PVCAction" backup=velero/daily-backup cmd=/velero logSource="pkg/backup/backup_pv_action.go:49" pluginName=velero
time="2021-09-22T08:47:55Z" level=info msg="Backing up item" backup=velero/daily-backup logSource="pkg/backup/item_backupper.go:121" name=pvc-123456789-abcdefgh namespace= resource=persistentvolumes
time="2021-09-22T08:47:55Z" level=info msg="Executing takePVSnapshot" backup=velero/daily-backup logSource="pkg/backup/item_backupper.go:405" name=pvc-123456789-abcdefgh namespace= resource=persistentvolumes
time="2021-09-22T08:47:55Z" level=info msg="Skipping snapshot of persistent volume because volume is being backed up with restic." backup=velero/daily-backup logSource="pkg/backup/item_backupper.go:423" name=pvc-123456789-abcdefgh namespace= persistentVolume=pvc-123456789-abcdefgh resource=persistentvolumes
time="2021-09-22T08:47:59Z" level=info msg="1 errors encountered backup up item" backup=velero/daily-backup logSource="pkg/backup/backup.go:427" name=redis-replicas-0
time="2021-09-22T08:47:59Z" level=error msg="Error backing up item" backup=velero/daily-backup error="pod volume backup failed: error running restic backup, stderr={\"message_type\":\"error\",\"error\":{\"Op\":\"lstat\",\"Path\":\"/host_pods/a1b2c3d4e5/volumes/kubernetes.io~azure-file/pvc-123456789-abcdefgh/appendonly.aof\",\"Err\":2},\"during\":\"scan\",\"item\":\"/host_pods/a1b2c3d4e5/volumes/kubernetes.io~azure-file/pvc-123456789-abcdefgh/appendonly.aof\"}\n{\"message_type\":\"error\",\"error\":{\"Op\":\"lstat\",\"Path\":\"appendonly.aof\",\"Err\":2},\"during\":\"archival\",\"item\":\"/host_pods/a1b2c3d4e5/volumes/kubernetes.io~azure-file/pvc-123456789-abcdefgh/appendonly.aof\"}\nWarning: failed to read all source data during backup\n: exit status 3" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/restic/backupper.go:179" error.function="github.com/vmware-tanzu/velero/pkg/restic.(*backupper).BackupPodVolumes" logSource="pkg/backup/backup.go:431" name=redis-replicas-0

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

reasonerjt commented 3 years ago

@felfa01 Those files are probably modified when restic is doing the backup, could you confirm if they are very temporary files and are the data important?

@dsu-igeek If that's the case think those errors can be ignored ?

felfa01 commented 3 years ago

@reasonerjt my guess is that the application (redis in this case) would store future append operations in this file. However, when no operations have been run on the redis cache the file is empty. For other caches where we have had actual operations run, the file is not empty and is properly backed up. So no, in the current state the file is not important.

I'd assume that the rest of files on the FileShare is properly backed up but the error makes it look like we have a faulty backup. I agree with you that the errors should be ignored (or at least not affect the backup status).

reasonerjt commented 3 years ago

@felfa01 For such race condition, if you want to avoid this partial failure in the status of the backup, please use a hook to freeze the filesystem.

felfa01 commented 3 years ago

@reasonerjt Unfortunately, since there is an application team that writes to this specific file system I can not go in and lock it since that would cause write failure from application side.

I have seen this same error (causing PartiallyFailed backups) happening when there are files on fileshare that are deleted during backup process.

Edit: After some additional testing I have found that these error-inducing files are in fact messing up the whole fileshare backups, i.e. other files on the same fileshare are not backed up properly causing loss of data. Suggestion would be to skip these error-inducing files but continue backing up the rest properly.

pavanfhw commented 3 years ago

I am having the same issue. Is there at least a workaround for this?

felfa01 commented 3 years ago

I am having the same issue. Is there at least a workaround for this?

By no means a workaround but I have not seen the issue arise when using disks instead of file shares.

pavanfhw commented 3 years ago

Hi @reasonerjt Maybe in my case I can avoid this error by excluding a cache directory from the backups files. Is it possible to add the --exclude flag to restic daemonset command restic server --features=? I am not sure if this works the same as the restic cli.

Or if there is a better way of doing this with Velero, please tell.

navilg commented 1 year ago

I am getting same error intermittently while backing up filesystem. is there any fix or workaround for this ? I cannot afford freezing filesystem everytime before backup as it will affect live application.

@Lyndon-Li Would you be able to help with this query ? has it been fixed in any version of velero ?

I am using velero 1.6.2 with restic integrated.

varac commented 11 months ago

It would be intersting to know if Kopia handles such cases differently/more graceful. Anyone has experienced this with Kopia ?