Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel

thomasklosinsky commented 2 months ago

What steps did you take and what happened: 1 pod with a pvc mounted via ceph-csi in namespace backuptest

Ran: velero backup create NAME --snapshot-data-move --include-namespace backuptest << successful kubectl delete namespace backuptest << of course, successful velero describe backup NAME --detail << all data looking good velero repo get << all data there velero restore create --from backup NAME

What did you expect to happen: I expected the pvc to be restored from S3 repo. But the datadownload failed with this error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel

The following information will help us better understand what's going on: Log from the node-agent:

time="2024-04-30T10:00:59Z" level=info msg="Restore PVC is created" logSource="pkg/exposer/generic_restore.go:103" owner=backuptest-01-20240430120059-2x42c pvc name=backuptest-01-20240430120059-2x42c source namespace=backuptest target PVC=test-pvc-2 time="2024-04-30T10:00:59Z" level=info msg="Restore is exposed" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:195" time="2024-04-30T10:00:59Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101" time="2024-04-30T10:01:11Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101" time="2024-04-30T10:01:11Z" level=info msg="Data download is prepared" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:226" time="2024-04-30T10:01:11Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101" time="2024-04-30T10:01:11Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:285" time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101" time="2024-04-30T10:10:55Z" level=info msg="Data download is in progress" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:285" time="2024-04-30T10:10:55Z" level=info msg="Data download is being canceled" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:287" time="2024-04-30T10:10:55Z" level=warning msg="Async fs backup data path canceled" controller=DataDownload datadownload=backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:398" time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101" time="2024-04-30T10:10:55Z" level=info msg="Reconcile backuptest-01-20240430120059-2x42c" controller=datadownload datadownload=velero/backuptest-01-20240430120059-2x42c logSource="pkg/controller/data_download_controller.go:101"

Environment:

Velero version (use velero version): 1.13.2
Velero features (use velero client config get features): NOT SET
Kubernetes version (use kubectl version): 1.30.0
Kubernetes installer & version:
Cloud provider or hardware configuration: S3 compatible self hosted (Ceph/Rados)
OS (e.g. from /etc/os-release): Ubuntu 22.04 LTS

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

thomasklosinsky commented 2 months ago

bundle-2024-04-30-12-25-13.tar.gz

sseago commented 2 months ago

That error means that the node agent restarted in the middle of processing a datadownload.

thomasklosinsky commented 2 months ago

Hmmm... no restarts visible: NAME READY STATUS RESTARTS AGE node-agent-4r7rq 1/1 Running 0 45h node-agent-8vzlr 1/1 Running 0 2d18h velero-7d6578dfcc-kdgfq 1/1 Running 0 2d20h

thomasklosinsky commented 2 months ago

Changed the Deamonset of the node agents to debug log level.

time="2024-05-02T08:02:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:02:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:03:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataUploadList" time="2024-05-02T08:03:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataUploadList" time="2024-05-02T08:03:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:03:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:04:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:04:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:04:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataUploadList" time="2024-05-02T08:04:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataUploadList" time="2024-05-02T08:05:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:05:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataUploadList" time="2024-05-02T08:05:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataUploadList" time="2024-05-02T08:05:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:06:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataUploadList" time="2024-05-02T08:06:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataUploadList" time="2024-05-02T08:06:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:06:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:07:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:07:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:07:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataUploadList" time="2024-05-02T08:07:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataUploadList" time="2024-05-02T08:08:48Z" level=debug msg="enqueueing resources ..." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:08:48Z" level=debug msg="skip enqueue object velero/backuptest-01-20240430120059-2x42c due to the predicate." controller=DataDownload logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataDownloadList" time="2024-05-02T08:08:48Z" level=debug msg="enqueueing resources ..." logSource="pkg/util/kube/periodical_enqueue_source.go:71" resource="v2alpha1.DataUploadList" time="2024-05-02T08:08:48Z" level=debug msg="skip enqueue object velero/backuptest-01-qmk62 due to the predicate." logSource="pkg/util/kube/periodical_enqueue_source.go:97" resource="v2alpha1.DataUploadList"

The restore already is canceled but the datadownload still is trying to "do something"...

Details of the datadownload:

kubectl -n velero get datadownloads -l velero.io/restore-name=backuptest-01-20240430120059 -o yaml apiVersion: v1 items:

apiVersion: velero.io/v2alpha1 kind: DataDownload metadata: creationTimestamp: "2024-04-30T10:00:59Z" generateName: backuptest-01-20240430120059- generation: 6 labels: velero.io/accepted-by: kubetestw1 velero.io/async-operation-id: dd-0f71eee4-ef16-4b20-ac1f-47cda0cbba3c.437d6380-f23b-437415fe8 velero.io/restore-name: backuptest-01-20240430120059 velero.io/restore-uid: 0f71eee4-ef16-4b20-ac1f-47cda0cbba3c name: backuptest-01-20240430120059-2x42c namespace: velero ownerReferences:
- apiVersion: velero.io/v1 controller: true kind: Restore name: backuptest-01-20240430120059 uid: 0f71eee4-ef16-4b20-ac1f-47cda0cbba3c resourceVersion: "1553867" uid: 2ac55336-9bce-4af8-bb67-16be88e90bda spec: backupStorageLocation: default cancel: true dataMoverConfig: WriteSparseFiles: "false" operationTimeout: 10m0s snapshotID: e1ee356bb1abc0855e739aed7eb201ae sourceNamespace: backuptest targetVolume: namespace: backuptest pv: "" pvc: test-pvc-2 status: completionTimestamp: "2024-04-30T10:10:55Z" message: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel node: kubetestw2 phase: Canceled progress: {} startTimestamp: "2024-04-30T10:00:59Z" kind: List metadata: resourceVersion: ""

danfengliu commented 2 months ago

Hi @thomasklosinsky, from debug bundle you provided, you can find out that node agent node-agent-4r7rq started at time="2024-04-30T10:10:54Z" and node agent node-agent-8vzlr found a InProgress Datadownload at time="2024-04-30T10:10:55Z", 10:10:54 is the restarting time of node agent and new one is node-agent-4r7rq. It is the restarting triggered by the restore request, mostly, lacking cluster resource is the reason of this, you can verify that by checking kubernetes cluster event logs, or increase memory of cluster nodes to ensure node agent functionality.

thomasklosinsky commented 2 months ago

backuptest01-20240506133312-wjdq4 1/1 Running 0 2m35s node-agent-mdxp2 1/1 Running 0 116m node-agent-p6nfj 0/1 Evicted 0 7s velero-7d6578dfcc-hnvgz 1/1 Running 0 116m

Normal Pulled pod/backuptest01-lq75c Container image "velero/velero:v1.13.2" already present on machine 15m Normal Created pod/backuptest01-lq75c Created container eb16cf27-0fa9-46f2-b410-0ae78e45595a 15m Normal Started pod/backuptest01-lq75c Started container eb16cf27-0fa9-46f2-b410-0ae78e45595a 12m Normal Killing pod/backuptest01-lq75c Stopping container eb16cf27-0fa9-46f2-b410-0ae78e45595a 105s Warning Evicted pod/node-agent-9fpgc The node was low on resource: ephemeral-storage. Threshold quantity: 1255215563, available: 1111428Ki. Container node-agent was using 1289240Ki, request is 0, has larger consumption of ephemeral-storage. 105s Normal Killing pod/node-agent-9fpgc Stopping container node-agent 104s Warning Evicted pod/node-agent-fzv9k The node had condition: [DiskPressure]. 103s Normal Scheduled pod/node-agent-fzv9k Successfully assigned velero/node-agent-fzv9k to kubetestw2 102s Normal Scheduled pod/node-agent-hh546 Successfully assigned velero/node-agent-hh546 to kubetestw2 103s Warning Evicted pod/node-agent-hh546 The node had condition: [DiskPressure]. 96s Normal Scheduled pod/node-agent-hpwcr Successfully assigned velero/node-agent-hpwcr to kubetestw2 97s Warning Evicted pod/node-agent-hpwcr The node had condition: [DiskPressure]. 71s Normal Scheduled pod/node-agent-lvqbx Successfully assigned velero/node-agent-lvqbx to kubetestw2 72s Warning Evicted pod/node-agent-lvqbx The node had condition: [DiskPressure]. 39s Normal Scheduled pod/node-agent-p6nfj Successfully assigned velero/node-agent-p6nfj to kubetestw2 40s Warning Evicted pod/node-agent-p6nfj The node had condition: [DiskPressure]. 88s Normal Scheduled pod/node-agent-rhn86 Successfully assigned velero/node-agent-rhn86 to kubetestw2 89s Warning Evicted pod/node-agent-rhn86 The node had condition: [DiskPressure]. 100s Normal Scheduled pod/node-agent-xnb5b Successfully assigned velero/node-agent-xnb5b to kubetestw2 101s Warning Evicted pod/node-agent-xnb5b The node had condition: [DiskPressure]. 104s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-9fpgc on node kubetestw2, will try to kill it 104s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-9fpgc 104s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-fzv9k 103s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-fzv9k on node kubetestw2, will try to kill it 103s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-fzv9k 103s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-hh546 101s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-hh546 on node kubetestw2, will try to kill it 101s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-hh546 101s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-xnb5b 97s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-xnb5b on node kubetestw2, will try to kill it 97s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-xnb5b 97s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-hpwcr 89s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-hpwcr on node kubetestw2, will try to kill it 89s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-hpwcr 89s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-rhn86 72s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-rhn86 on node kubetestw2, will try to kill it 72s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-rhn86 72s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-lvqbx 40s Warning FailedDaemonPod daemonset/node-agent Found failed daemon pod velero/node-agent-lvqbx on node kubetestw2, will try to kill it 40s Normal SuccessfulDelete daemonset/node-agent Deleted pod: node-agent-lvqbx 40s Normal SuccessfulCreate daemonset/node-agent Created pod: node-agent-p6nfj

Means, that the IO is too slow...... right?

danfengliu commented 2 months ago

The node was low on resource: ephemeral-storage has told us your local storage has no space for node agent to cache data, you should check your node ephemeral-storage, and increase that.

danfengliu commented 2 months ago

We had issues #7725 and #7620 as solutions to let users have ways to avoid using a lot of ephemeral-storage, but these issues have not be planned yet, currently you have to increase your node ephemeral-storage to allow node agent using enough space to cache data for which are to be restoring.

thomasklosinsky commented 1 month ago

Added some space to k8s worker vm disk and it's working now. thx a lot!

vmware-tanzu / velero

Restore from S3 repo - failed with error: found a datadownload with status "InProgress" during the node-agent starting, mark it as cancel #7761