vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.79k stars 1.41k forks source link

Backup stuck in phase WaitingForPluginOperations after datamover eviction #8367

Open msfrucht opened 3 weeks ago

msfrucht commented 3 weeks ago

What steps did you take and what happened:

Tested what happens with datamover evictions by using kubectl evict-pod to trigger evictions. https://github.com/rajatjindal/kubectl-evict-pod

Backup a namespace with 5 PVCs. During backup evict one of the datamovers with kubectl evict-pod delay-backup-eviction-test-m8cgf-tstnj -n oadp-1-4

After installing Velero 1.15.0-rc.2 with the Dockerfile changes for OpenShift. https://github.com/msfrucht/openshift-velero/commits/velero_in_openshift_1.15.0-rc.2

Ran new backups on namespaced with 5 PVCs and datamover load concurrency set to 5 to allow all datamovers to immediately to begin immediately. No load affinity settings. backupPVC set to settings to allow for CephFs and IBM Storage Scale fast read-only snapshot restore behaviors.

What did you expect to happen:

Backup to finish with phase PartiallyFailed.

If you are using velero v1.7.0+:

_output/bin/linux/amd64/velero debug --backup=delay-backup-eviction-test-m8cgf --namespace oadp-1-4
2024/11/01 11:11:02 Collecting velero resources in namespace: oadp-1-4
2024/11/01 11:11:04 Collecting velero deployment logs in namespace: oadp-1-4
2024/11/01 11:11:05 Collecting log and information for backup: delay-backup-eviction-test-m8cgf
2024/11/01 11:11:08 Generated debug information bundle: /root/guardian/guardian-velero/bundle-2024-11-01-11-11-01.tar.gz

bundle-2024-11-01-11-11-01.tar.gz

Anything else you would like to add: Item operations values in status are as expected except for the Phase.

status:
  formatVersion: 1.1.0
  backupItemOperationsCompleted: 4
  backupItemOperationsAttempted: 5
  progress:
    itemsBackedUp: 39
    totalItems: 39
  expiration: '2024-12-01T17:39:12Z'
  startTimestamp: '2024-11-01T17:39:12Z'
  hookStatus: {}
  version: 1
  phase: WaitingForPluginOperations

Does not block additional backups from taking place. Next backup created finished successfully without issue.

I suspect issue also exists in Restore.

Environment:

sh-5.1$ ./velero version Client: Version: main Git commit: 1aa173cd44f9cd500f1da5dbe669874622070cc1- Server: Version: main

 ./velero client config get features
features: <NOT SET>

--features=EnableCSI'

Client Version: 4.15.12
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: 4.16.8
Kubernetes Version: v1.29.7+4510e9c

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 3 weeks ago

I am afraid this is the expected behavior according to the design and implementation --- we haven't consider some of the resilience parts including this one. Let's make an extensive design to consider the resilience and robustness for data mover in future releases.