vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.59k stars 1.39k forks source link

Backup PartiallyFailed when resource deleted during backup #4330

Open kojanabzah opened 2 years ago

kojanabzah commented 2 years ago

What steps did you take and what happened: We have backup running taking between 25m-50m, during the backup items are deleted (volumesnapshotcontents) by other components. we are getting PartiallyFailed status on the backup with information that the item it was trying to backup was not found in the log .

What did you expect to happen: item that was deleted in the process of the backup is ok, if api server return not found then its probably deleted and not error .

The output of the following commands will help us better understand what's going on: (Pasting long output into a GitHub gist or other pastebin is fine.)

{"backup":"velero/velero-devspace-backup-20211028170003","error.file":"/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:331","error.function":"github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=my-ns, name=workspaces-ws-20210730165910): rpc error: code = Unknown desc = error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-c6c8fecb-f1c0-4bcf-9642-ca90d6c87f25\" not found","level":"error","logSource":"pkg/backup/backup.go:431","msg":"Error backing up item","name":"workspaces-ws-20210730165910","time":"2021-10-28T17:14:24Z"}
{"backup":"velero/velero-devspace-backup-20211028170003","error.file":"/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:331","error.function":"github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=my-ns name=workspaces-ws-20210906203303): rpc error: code = Unknown desc = error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-f2b9ae35-b945-4306-bc8a-d819ff3db269\" not found","level":"error","logSource":"pkg/backup/backup.go:431","msg":"Error backing up item","name":"workspaces-ws-20210906203303","time":"2021-10-28T17:14:52Z"}

Anything else you would like to add: our backups contains around 140k items thats why its taking long time , during that process we delete items form cleanup process unrelated to velero.

Environment:

velero version 1.6.3 kubernetes 1.20.11

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

zubron commented 2 years ago

Hi @kojanabzah :wave:

I'm not very familiar with the CSI plugin, but I'll try to help here :) Looking at the plugin code, this particular error is only triggered for VolumeSnapshots created outside of Velero's control (not part of Velero's backup process). The plugin is assuming that any VolumeSnapshot object that are being backed up should have a VolumeSnapshotContent available.

I'm trying to understand the workflow of the other components that you are using. The CSI plugin is trying to fetch the VolumeSnapshotContent that is bound to a particular VolumeSnapshot. In your use case, the VolumeSnapshot exists but is bound to a VolumeSnapshotContent that has been deleted. Are those VolumeSnapshots also deleted as part of this flow?

kojanabzah commented 2 years ago

Hi @zubron, Thanks so much for looking into this !

The VolumeSnapshots was not created by velero, we dont backup snapshots using velero(snapshotVolumes: false) , velero is just backing up the yaml׳s of the VolumeSnapshots&VolumeSnapshotContent .

just to clarify our flow again , we have a process that is unrelated to velero that created VolumeSnapshots and delete them after X time . before the deletion velero schedule is triggered and sees that i needs to backup the VolumeSnapshots&VolumeSnapshotsContent yaml files , while velero is running our process is running and decide that VolumeSnapshots is old and deletes it .

Based on the velero code my understanding is that velero creates a list of items it need to backup before doing the actual backup, when velero created that list we had VolumeSnapshotsContent , but when velero reached it to do the backup it was already deleted . if am understanding correctly and velero is saving a list before doing the actual backup it should consider maybe if the item was deleted while iterating over list , in our case the backup process is very long (25m-50m, around 140k items) so its very possible items will be deleted.

kojanabzah commented 2 years ago

Anyone can confirm if its an issue or not ? the error generated from the csi plugin but i dont this is an issue in the plugin side. velero is the one that holds the the VolumeSnapshotContent in memory. from what i understand, and please correct me if am wrong, velero first collects the items its going to backup and then iterate over them and do the actual backup, during that iteration we are deleting items in the list the velero holds (in our case VolumeSnapshotContent) so the plugin returns error as it should because VolumeSnapshotContent was deleted .

if i am correct then velero just need to check if the error returned from plugin is notFound error if so its ok .

if some one can update this issue labels as i think i provided information and i dont believe its related to csi , if you think more info is needed let me know.

bhargavkeshav commented 2 years ago

HI Team,

Any updates on this ? I am facing same issue in the EKS environment with EBS volume (installed with EBS-CSI driver). Backups are getting completed sometimes and getting partially failed most of the time with below error.

time="2022-06-09T03:18:42Z" level=info msg="1 errors encountered backup up item" backup=velero/velero-utilities-backup-20220609030018 logSource="pkg/backup/backup.go:427" name=velero-buffer-sumologic-collector-sumologic-fluentd-metricx2v9p
time="2022-06-09T03:18:42Z" level=error msg="Error backing up item" backup=velero/velero-utilities-backup-20220609030018 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=observability-system, name=velero-buffer-sumologic-collector-sumologic-fluentd-metricx2v9p): rpc error: code = Unknown desc = error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-22316cfe-8f04-4e87-9fe4-12c432aa6ce0\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:331" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:431" name=velero-buffer-sumologic-collector-sumologic-fluentd-metricx2v9p
time="2022-06-09T03:18:43Z" level=info msg="1 errors encountered backup up item" backup=velero/velero-utilities-backup-20220609030018 logSource="pkg/backup/backup.go:427" name=velero-buffer-sumologic-collector-sumologic-fluentd-metricxr9gc
time="2022-06-09T03:18:43Z" level=error msg="Error backing up item" backup=velero/velero-utilities-backup-20220609030018 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=observability-system, name=velero-buffer-sumologic-collector-sumologic-fluentd-metricxr9gc): rpc error: code = Unknown desc = error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-c6e3690d-bd54-4872-80f5-39369e88f154\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:331" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:431" name=velero-buffer-sumologic-collector-sumologic-fluentd-metricxr9gc

Please help me in fixing this.

EKS version: 1.0 velero version: chart:velero-2.19.2 appversion: 1.6.0

thirdeyenick commented 2 years ago

Hi there, we are facing the same issue. In our case velero is the only application which creates volumesnapshots. We do have schedules which overlap in terms of times and namespaces to back up. We are periodically getting messages like:

time="2022-07-27T06:54:42Z" level=error msg="Error backing up item" backup=nine-velero/velero-dailybackup-20220727063705 error="error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=graphql-stage, name=velero-redis-data-graphql-stage-redis-master-0-8tcvh): rpc error: code = Unknown desc = error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-344e4bd5-8c72-4473-bd5b-2d396166956e\" not found" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:316" error.function="github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions" logSource="pkg/backup/backup.go:417" name=velero-redis-data-graphql-stage-redis-master-0-8tcvh

It seems that during a backup run of one schedule, old snapshots are being removed from a run of another schedule.

jsmcnair commented 1 year ago

I enabled debug logging to trace this through and maybe see if I can create a fix. Messages redacted and in reverse order (newest first).

2023-06-27T14:45:24Z    {"phase":"FinalizingPartiallyFailed","backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/controller/backup_controller.go:244","msg":"Backup is not handled"}
2023-06-27T14:45:24Z    {"msg":"Backup completed","backup":"velero/policy-name-20230627143726","level":"info","logSource":"pkg/controller/backup_controller.go:785"}
2023-06-27T14:45:24Z    {"backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/controller/backup_controller.go:244","msg":"Backup is not handled","phase":"PartiallyFailed"}
2023-06-27T14:45:24Z    {"backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/controller/backup_controller.go:244","msg":"Backup is not handled","phase":"InProgress"}
2023-06-27T14:45:23Z    {"level":"info","logSource":"pkg/controller/backup_controller.go:771","msg":"Setting up backup store to persist the backup","backup":"velero/policy-name-20230627143726"}

2023-06-27T14:44:21Z    {"name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/plugin/framework/action_resolver.go:61","msg":"Skipping action because it does not apply to this resource"}
2023-06-27T14:44:21Z    {"name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/plugin/framework/action_resolver.go:61","msg":"Skipping action because it does not apply to this resource"}
2023-06-27T14:44:21Z    {"name":"velero-storage-persistent-volume-name","namespace":"namespace","progress":"","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"info","logSource":"pkg/backup/backup.go:397","msg":"Backed up 3581 items out of an estimated total of 5778 (estimate will change throughout the backup)"}
2023-06-27T14:44:21Z    {"namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/plugin/framework/action_resolver.go:61","msg":"Skipping action because it does not apply to this resource","name":"velero-storage-persistent-volume-name"}
2023-06-27T14:44:21Z    {"msg":"Executing post hooks","name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/backup/item_backupper.go:224"}
2023-06-27T14:44:21Z    {"logSource":"pkg/plugin/framework/action_resolver.go:61","msg":"Skipping action because it does not apply to this resource","name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"debug"}
2023-06-27T14:44:21Z    {"logSource":"pkg/backup/item_backupper.go:173","msg":"Backing up item","name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726","level":"info"}
2023-06-27T14:44:21Z    {"level":"info","logSource":"pkg/backup/item_backupper.go:325","msg":"Executing custom action","name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726"}
2023-06-27T14:44:21Z    {"level":"info","logSource":"pkg/backup/backup.go:357","msg":"Processing item","name":"velero-storage-persistent-volume-name","namespace":"namespace","progress":"","resource":"volumesnapshots.snapshot.storage.k8s.io","backup":"velero/policy-name-20230627143726"}
2023-06-27T14:44:21Z    {"level":"error","logSource":"pkg/backup/backup.go:435","msg":"Error backing up item","name":"velero-storage-persistent-volume-name","backup":"velero/policy-name-20230627143726","error.file":"/go/src/github.com/vmware-tanzu/velero/pkg/backup/item_backupper.go:337","error.function":"github.com/vmware-tanzu/velero/pkg/backup.(*itemBackupper).executeActions","error.message":"error executing custom action (groupResource=volumesnapshots.snapshot.storage.k8s.io, namespace=namespace, name=velero-storage-persistent-volume-name): rpc error: code = Unknown desc = error getting volume snapshot content from API: volumesnapshotcontents.snapshot.storage.k8s.io \"snapcontent-2c5ecea2-2b45-4519-9a0f-6c3d92eb7614\" not found"}
2023-06-27T14:44:21Z    {"backup":"velero/policy-name-20230627143726","level":"info","logSource":"pkg/backup/backup.go:431","msg":"1 errors encountered backup up item","name":"velero-storage-persistent-volume-name"}
2023-06-27T14:44:21Z    {"backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/plugin/framework/action_resolver.go:61","msg":"Skipping action because it does not apply to this resource","name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io"}
2023-06-27T14:44:21Z    {"backup":"velero/policy-name-20230627143726","level":"debug","logSource":"pkg/backup/item_backupper.go:181","msg":"Executing pre hooks","name":"velero-storage-persistent-volume-name","namespace":"namespace","resource":"volumesnapshots.snapshot.storage.k8s.io"}
2023-06-27T14:44:21Z    {"backup":"velero/policy-name-20230627143726","cmd":"/plugins/velero-plugin-for-csi","level":"info","logSource":"/go/src/velero-plugin-for-csi/internal/backup/volumesnapshot_action.go:91","msg":"Getting VolumesnapshotContent for Volumesnapshot namespace/velero-storage-persistent-volume-name","pluginName":"velero-plugin-for-csi"}

2023-06-27T14:41:30Z    {"level":"info","logSource":"internal/delete/delete_item_action_handler.go:111","msg":"invoking DeleteItemAction plugins","namespace":"namespace","item":"velero-storage-persistent-volume-name"}
2023-06-27T14:41:30Z    {"cmd":"/plugins/velero-plugin-for-csi","controller":"backup-deletion","deletebackuprequest":"velero/policy-name-20230626143637-dzd8g","level":"info","logSource":"/go/src/velero-plugin-for-csi/internal/delete/volumesnapshot_action.go:49","msg":"Deleting Volumesnapshot namespace/velero-storage-persistent-volume-name","pluginName":"velero-plugin-for-csi","backup":"policy-name-20230626143637"}

2023-06-27T14:37:26Z    {"logSource":"pkg/controller/backup_controller.go:623","msg":"Getting backup item actions","backup":"velero/policy-name-20230627143726","level":"info"}
2023-06-27T14:37:26Z    {"level":"info","logSource":"pkg/controller/backup_controller.go:612","msg":"Setting up backup temp file","backup":"velero/policy-name-20230627143726"}
2023-06-27T14:37:26Z    {"backup":"velero/policy-name-20230627143726","level":"info","logSource":"pkg/controller/backup_controller.go:628","msg":"Setting up backup store to check for backup existence"}
2023-06-27T14:37:26Z    {"backup":"velero/policy-name-20230627143726","level":"info","logSource":"pkg/controller/backup_controller.go:619","msg":"Setting up plugin manager"}
2023-06-27T14:37:26Z    {"backup":"velero/policy-name-20230627143726","level":"info","logSource":"pkg/controller/backup_controller.go:601","msg":"Setting up backup log"}

So we can clearly see that the backup starts, then the item is deleted prior to the backup getting to the item.

jsmcnair commented 1 year ago

In the itembackupper there is already a check to see if the item is being deleted:

https://github.com/vmware-tanzu/velero/blob/de83980a05059f0e41ed00f1cf0ef061a9ebc049/pkg/backup/item_backupper.go#L155-L158

Seems reasonable to check the item also still exists?

jsmcnair commented 1 year ago

Actually I think the issue is better dealt with here:

https://github.com/vmware-tanzu/velero-plugin-for-csi/blob/main/internal/backup/volumesnapshot_action.go#L96-L102

The plugin should check whether the VS still exists before trying to get the VSC resource

draghuram commented 1 year ago

I think it would be much clearer to state the issue outside the context of snapshot resources. Please let me know if this sounds correct:

jsmcnair commented 1 year ago

I created a fix in two places for this and my problem continued. So I did another round of log tracing and investigation, looking through the code. I couldn't make sense of why I was still seeing some snapshot errors.

I believe I have found the solution to my problem and, somewhat embarrassingly, it's not an issue that is caused by Velero or the CSI plugin...

I reviewed the audit logs for the snapshot creation and found that ArgoCD was kindly pruning the VolumeSnapshot resources before the backup process has completed. I've now configured ArgoCD to ignore the VolumeSnapshot and VolumeSnapshotContent resources and the errors have gone away.

So. If you use a GitOps controller that is configured to prune resources that should not exist in your namespace, consider excluding those resources from the controller.

draghuram commented 1 year ago

That's an interesting find. It is also dangerous to delete VolumeSnapshot resources especially if deletion policy of the volume snapshot class is "Delete". In that case, storage snapshots will be deleted when VolumeSnapshots are deleted. I think a warning should be added to Velero CSI docs.