Closed duduvaa closed 1 year ago
The error is as below:
time="2023-08-31T06:47:22Z" level=error msg="Failed to rebind PV to target PVC on completion" controller=DataDownload datadownload=dm-restore-100pods-2gb-rmgrg error="error to wait restore PV bound, restore PV pvc-ac7cf4cb-b58d-48b4-a3df-6f8931b17150: error to wait for bound of PV: pv has been bound by unexpected pvc upstream-velero/dm-restore-100pods-2gb-gjwrd" error.file="/go/src/github.com/vmware-tanzu/velero/pkg/util/kube/pvc_pv.go:287" error.function=github.com/vmware-tanzu/velero/pkg/util/kube.WaitPVBound.func1 logSource="pkg/controller/data_download_controller.go:293"
The restore PV is bound by another restore PVC instead of the target PVC.
Thanks @Lyndon-Li
According to the above error The error includes the 2 problematic datadownloads resoures: dm-restore-100pods-2gb-rmgrg dm-restore-100pods-2gb-gjwrd
output from DEBUG/datadownloads_issue.log Thu Aug 31 06:47:04 UTC 2023 dm-restore-100pods-2gb-rmgrg InProgress 21s 1353830747 2119680347 default 21s worker000-r640 Thu Aug 31 06:48:05 UTC 2023 dm-restore-100pods-2gb-gjwrd Prepared 44s default 44s worker002-r640 dm-restore-100pods-2gb-rmgrg InProgress 81s 2119680347 2119680347 default 81s worker000-r640
Thu Aug 31 07:03:06 UTC 2023 dm-restore-100pods-2gb-gjwrd InProgress 15m 2119680347 2119680347 default 15m worker002-r640 dm-restore-100pods-2gb-rmgrg InProgress 16m 2119680347 2119680347 default 16m worker000-r640
@duduvaa
The second error (for dm-restore-100pods-2gb-gjwrd
) is a consequence of the first error (for dm-restore-100pods-2gb-rmgrg
):
dm-restore-100pods-2gb-rmgrg
failed.dm-restore-100pods-2gb-rmgrg
, the controller deleted all the intermediate resources, including the wrongly bound restorePV pvc-ac7cf4cb-b58d-48b4-a3df-6f8931b17150
upstream-velero/dm-restore-100pods-2gb-gjwrd
that did things wrong was right for DD dm-restore-100pods-2gb-gjwrd
. It means PVC upstream-velero/dm-restore-100pods-2gb-gjwrd
was binding to PV pvc-ac7cf4cb-b58d-48b4-a3df-6f8931b17150
Fixed by #6738 and #6743
Bug verified OK. All 100 PVs were created and restore ended with 'Completed' status.
Thank you @Lyndon-Li for the quick fix
What steps did you take and what happened:
Running datamover restore of 100 PVs. A few 'datadownloads' resources are stuck in 'InProgress' status although the total bytes already done. The restore CR status is 'WaitingForPluginOperations' till the 4hrs timeout is reached. 100 PVC were created - 98 are 'Bound' and 2 are 'Pending'.
Steps & Commands: datadownloadsissue.tar.gz
What did you expect to happen: Restore all pods & PVs successfully
The following information will help us better understand what's going on:
./velero debug --restore dm-restore-100pods-2gb -nupstream-velero 2023/08/31 11:05:54 Collecting velero resources in namespace: upstream-velero 2023/08/31 11:05:55 Collecting velero deployment logs in namespace: upstream-velero 2023/08/31 11:05:58 Collecting log and information for restore: dm-restore-100pods-2gb 2023/08/31 11:05:58 Generated debug information bundle: bundle-2023-08-31-11-05-54.tar.gz
DEBUG folder (attached) includes:
DM-Data folder includes all datadownloads yaml file (oc get datadownloads {NAME} -nupstream-velero -oyaml command)
Anything else you would like to add:
I ran the same test with 50, 80 ,90, 100 , 120 PVs per namespace. 50 & 80 PVs - Completed successfully 90 , 100 & 120 - Restore failed.
Environment:
Velero version: main (Velero-1.12) , last commit commit 30e54b026f34c17e309791e0180c68f13e58c4d9 (HEAD -> main, origin/main, origin/HEAD) Author: Daniel Jiang jiangd@vmware.com Date: Wed Aug 16 15:45:00 2023 +0800
Velero features (use
velero client config get features
):./velero client config get features
features:
Kubernetes version (use
kubectl version
):oc version
Client Version: 4.12.9 Kustomize Version: v4.5.7 Server Version: 4.12.9 Kubernetes Version: v1.25.7+eab9cc9
Cloud provider or hardware configuration: OCP running over BM servers 3 masters & 6 workers nodes
oc get nodes
NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 148d v1.25.7+eab9cc9 master-1 Ready control-plane,master 148d v1.25.7+eab9cc9 master-2 Ready control-plane,master 148d v1.25.7+eab9cc9 worker000-r640 Ready worker 148d v1.25.7+eab9cc9 worker001-r640 Ready worker 148d v1.25.7+eab9cc9 worker002-r640 Ready worker 148d v1.25.7+eab9cc9 worker003-r640 Ready worker 148d v1.25.7+eab9cc9 worker004-r640 Ready worker 148d v1.25.7+eab9cc9 worker005-r640 Ready worker 148d v1.25.7+eab9cc9
OS (e.g. from
/etc/os-release
): Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 Part of OpenShift 4.12, RHCOS is a Kubernetes native operating systemcat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="412.86.202303211731-0" VERSION_ID="4.12" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/" BUG_REPORT_URL="https://access.redhat.com/labs/rhir/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.12" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.12" OPENSHIFT_VERSION="4.12" RHEL_VERSION="8.6" OSTREE_VERSION="412.86.202303211731-0"
Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.