Closed erichevers closed 1 month ago
metadata.managedFields removed for brevity.
{
"apiVersion": "velero.io/v2alpha1",
"kind": "DataUpload",
"metadata": {
"creationTimestamp": "2024-10-15T16:12:48Z",
"generateName": "test-backup-",
"generation": 7,
"labels": {
"velero.io/accepted-by": "nl1k8s051",
"velero.io/async-operation-id": "du-26a5d530-3ad3-42e7-b628-fdbfe5b46f81.c86e6b7b-baf7-4bb6b9b76",
"velero.io/backup-name": "test-backup",
"velero.io/backup-uid": "26a5d530-3ad3-42e7-b628-fdbfe5b46f81",
"velero.io/pvc-uid": "c86e6b7b-baf7-4bb2-b645-c5b56b277abf"
},
"name": "test-backup-fxjtf",
"namespace": "velero",
"ownerReferences": [
{
"apiVersion": "velero.io/v1",
"controller": true,
"kind": "Backup",
"name": "test-backup",
"uid": "26a5d530-3ad3-42e7-b628-fdbfe5b46f81"
}
],
"resourceVersion": "60636043",
"uid": "f685ea82-dd12-43bc-b085-ea8a012c60d6"
},
"spec": {
"backupStorageLocation": "default",
"cancel": true,
"csiSnapshot": {
"snapshotClass": "csi-rbdplugin-snapclass",
"storageClass": "rook-ceph-block",
"volumeSnapshot": "velero-vaidio-12346-vaidio-data-volume-vrn8x"
},
"operationTimeout": "10m0s",
"snapshotType": "CSI",
"sourceNamespace": "vaidio-12346",
"sourcePVC": "vaidio-12346-vaidio-data-volume"
},
"status": {
"completionTimestamp": "2024-10-15T16:13:17Z",
"message": "found a dataupload with status \"InProgress\" during the node-agent starting, mark it as cancel",
"node": "nl1k8s032",
"phase": "Canceled",
"progress": {
"bytesDone": 380953606,
"totalBytes": 1544556558
},
"startTimestamp": "2024-10-15T16:12:56Z"
}
Seen this before. This happens due to a node-agent pod restart. Usually due to a resource eviction issue.
Which also explains why the logs of pod node-agent-wgqlk on node nl1k8s032 is so short.
Resources are set very low for data movement, especially for a volume of that size. More memory may be required.
"resources": {
"limits": {
"cpu": "2",
"memory": "1Gi"
},
"requests": {
"cpu": "1",
"memory": "512Mi"
}
},
May want to consider deployment/daemonset details capture in the future. Evictions events typically show up in an event listing (rolls off over time) and in deployment/daemonset status.
One thing to add. Here is the config of the default BSL spec: config: checksumAlgorithm: "" region: ams3 s3ForcePathStyle: "true" s3Url: https://ams3.digitaloceanspaces.com
With or without the checksumAlgorithm gave the same result. And smaller backups to the S3 compatible storage work, so the credentials should be fine.
Thanks for the response. I've bumped the resources of the velero deployment/pod to: resources: limits: cpu: "6" memory: 10Gi requests: cpu: "2" memory: 5Gi
But still got the same issue. It did go a bit futher it seems, but is still cancelled. NAME STATUS STARTED BYTES DONE TOTAL BYTES STORAGE LOCATION AGE NODE test-backup-2dwzj Canceled 23s 386087955 1544556558 default 37s nl1k8s033
Do i need to delete all the node-agents to get the new cpu/mem values?
When the daemonset node-agent resources are changed the pods should automatically run a rolling restart. Make sure the rolling restart finishes before restarting a backup. If a node-agent restarts during data movement, the DataUpload will looks the same as before, cancelled due to restart. Typically by checking the metadata.creationTimestamp is after the restart.
10Gi of memory is typically enough for most environments. With kopia data movement an increase of CPU limits will also increase the memory used due to additional data streams. So I would hold that back to what it was and keep it constant until you find the initial memory requirements.
If you have prometheus or grafana or simply kubectl top if nothing else available with a simple script running in the background. Since the backup time has been 23s last time you may want a smaller interval to start out with.
watch --interval 15 "kubectl top pod --sort-by=memory --namespace <velero namespace> -l name=node-agent | tee -a pod-resources.txt"
is a rough if simple solution to check node-agent pod memory usage during backup.
Thanks for pointing me at the right direction. I did check the cpu and memory usage in Grafana. However I only had increased the memory of the Velero deployment, not the daemonset of the node-agents. I've now increased the memory of the daemonset of the node-agents to 10Gi and that did the trick. All node-agents restarted and the backup finished completely. I will see if the 10Gi can be lowered, but for now i'm happy everything works. Thanks very much for the fast response.
What steps did you take and what happened: doing "velero backup create test-backup --include-namespaces vaidio-12346 --snapshot-move-data" makes a good backup for the pvc's with a few GB's, but fails on a larger volume (200GB). After some datatransport for that volume it goes in cancelled state. This happens making a backup to DigitalOcean Spaces and NetApp Gridstore, both AWS S3 compatible objectstorage
What did you expect to happen: data movement from a snapshot of a large volume to succeed
The following information will help us better understand what's going on:
If you are using velero v1.7.0+:
Please use
velero debug --backup <backupname> --restore <restorename>
to generate the support bundle, and attach to this issue, more options please refer tovelero debug --help
bundle-2024-10-15-18-14-43.tar.gzIf you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)
kubectl logs deployment/velero -n velero
velero backup describe <backupname>
orkubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename>
orkubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>
Anything else you would like to add:
I'm using the aws plugin version 1.10.1, but also tries others, but go the same issue The source of the data is on rook-ceph
Environment:
velero version
): 1.14.1velero client config get features
): features: EnableCSIkubectl version
):-- Client Version: v1.30.0 -- Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 -- Server Version: v1.30.1+rke2r1
/etc/os-release
):Vote on this issue!
This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.