vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

snapshot-move-data of larger volumes gets cancelled after first few GB to AWS S3 compatible storage #8303

Closed erichevers closed 1 month ago

erichevers commented 1 month ago

What steps did you take and what happened: doing "velero backup create test-backup --include-namespaces vaidio-12346 --snapshot-move-data" makes a good backup for the pvc's with a few GB's, but fails on a larger volume (200GB). After some datatransport for that volume it goes in cancelled state. This happens making a backup to DigitalOcean Spaces and NetApp Gridstore, both AWS S3 compatible objectstorage

What did you expect to happen: data movement from a snapshot of a large volume to succeed

The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help bundle-2024-10-15-18-14-43.tar.gz

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

Anything else you would like to add:

I'm using the aws plugin version 1.10.1, but also tries others, but go the same issue The source of the data is on rook-ceph

Environment:

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

msfrucht commented 1 month ago

metadata.managedFields removed for brevity.

        {
            "apiVersion": "velero.io/v2alpha1",
            "kind": "DataUpload",
            "metadata": {
                "creationTimestamp": "2024-10-15T16:12:48Z",
                "generateName": "test-backup-",
                "generation": 7,
                "labels": {
                    "velero.io/accepted-by": "nl1k8s051",
                    "velero.io/async-operation-id": "du-26a5d530-3ad3-42e7-b628-fdbfe5b46f81.c86e6b7b-baf7-4bb6b9b76",
                    "velero.io/backup-name": "test-backup",
                    "velero.io/backup-uid": "26a5d530-3ad3-42e7-b628-fdbfe5b46f81",
                    "velero.io/pvc-uid": "c86e6b7b-baf7-4bb2-b645-c5b56b277abf"
                },
                "name": "test-backup-fxjtf",
                "namespace": "velero",
                "ownerReferences": [
                    {
                        "apiVersion": "velero.io/v1",
                        "controller": true,
                        "kind": "Backup",
                        "name": "test-backup",
                        "uid": "26a5d530-3ad3-42e7-b628-fdbfe5b46f81"
                    }
                ],
                "resourceVersion": "60636043",
                "uid": "f685ea82-dd12-43bc-b085-ea8a012c60d6"
            },
            "spec": {
                "backupStorageLocation": "default",
                "cancel": true,
                "csiSnapshot": {
                    "snapshotClass": "csi-rbdplugin-snapclass",
                    "storageClass": "rook-ceph-block",
                    "volumeSnapshot": "velero-vaidio-12346-vaidio-data-volume-vrn8x"
                },
                "operationTimeout": "10m0s",
                "snapshotType": "CSI",
                "sourceNamespace": "vaidio-12346",
                "sourcePVC": "vaidio-12346-vaidio-data-volume"
            },
            "status": {
                "completionTimestamp": "2024-10-15T16:13:17Z",
                "message": "found a dataupload with status \"InProgress\" during the node-agent starting, mark it as cancel",
                "node": "nl1k8s032",
                "phase": "Canceled",
                "progress": {
                    "bytesDone": 380953606,
                    "totalBytes": 1544556558
                },
                "startTimestamp": "2024-10-15T16:12:56Z"
            }

Seen this before. This happens due to a node-agent pod restart. Usually due to a resource eviction issue.

Which also explains why the logs of pod node-agent-wgqlk on node nl1k8s032 is so short.

Resources are set very low for data movement, especially for a volume of that size. More memory may be required.

                        "resources": {
                            "limits": {
                                "cpu": "2",
                                "memory": "1Gi"
                            },
                            "requests": {
                                "cpu": "1",
                                "memory": "512Mi"
                            }
                        },

May want to consider deployment/daemonset details capture in the future. Evictions events typically show up in an event listing (rolls off over time) and in deployment/daemonset status.

erichevers commented 1 month ago

One thing to add. Here is the config of the default BSL spec: config: checksumAlgorithm: "" region: ams3 s3ForcePathStyle: "true" s3Url: https://ams3.digitaloceanspaces.com

With or without the checksumAlgorithm gave the same result. And smaller backups to the S3 compatible storage work, so the credentials should be fine.

erichevers commented 1 month ago

Thanks for the response. I've bumped the resources of the velero deployment/pod to: resources: limits: cpu: "6" memory: 10Gi requests: cpu: "2" memory: 5Gi

But still got the same issue. It did go a bit futher it seems, but is still cancelled. NAME STATUS STARTED BYTES DONE TOTAL BYTES STORAGE LOCATION AGE NODE test-backup-2dwzj Canceled 23s 386087955 1544556558 default 37s nl1k8s033

Do i need to delete all the node-agents to get the new cpu/mem values?

msfrucht commented 1 month ago

When the daemonset node-agent resources are changed the pods should automatically run a rolling restart. Make sure the rolling restart finishes before restarting a backup. If a node-agent restarts during data movement, the DataUpload will looks the same as before, cancelled due to restart. Typically by checking the metadata.creationTimestamp is after the restart.

10Gi of memory is typically enough for most environments. With kopia data movement an increase of CPU limits will also increase the memory used due to additional data streams. So I would hold that back to what it was and keep it constant until you find the initial memory requirements.

If you have prometheus or grafana or simply kubectl top if nothing else available with a simple script running in the background. Since the backup time has been 23s last time you may want a smaller interval to start out with.

watch --interval 15 "kubectl top pod --sort-by=memory --namespace <velero namespace> -l name=node-agent | tee -a pod-resources.txt"

is a rough if simple solution to check node-agent pod memory usage during backup.

erichevers commented 1 month ago

Thanks for pointing me at the right direction. I did check the cpu and memory usage in Grafana. However I only had increased the memory of the Velero deployment, not the daemonset of the node-agents. I've now increased the memory of the daemonset of the node-agents to 10Gi and that did the trick. All node-agents restarted and the backup finished completely. I will see if the 10Gi can be lowered, but for now i'm happy everything works. Thanks very much for the fast response.