Sudden increase in backup time duration for Velero v1.12.2

kskurtveit commented 10 months ago

What steps did you take and what happened: I upgraded one of our development clusters to the new Velero v1.12.2 and suddenly I saw an huge increase in backup time than running Velero version 1.11.1. Earlier in our development cluster backup usually takes 5-6 minutes with restic. And with the upgraded installation also with restic it took almost 40 minutes.

What did you expect to happen:

I expect the same speed as before, since it's only the Velero version that have changed. I understand that Kopia will replace restic in the future versions, but I did not expect that a such an increase in backup time just upgrading the velero version. For the examples I've written about below I have just used restic.

Anything else you would like to add: Further on I took and installed velero in a kind cluster using the same backup-location as our dev cluster. Took a backup with Velero version 1.11.1 and also made a log bundle. All seems good, and since we're a kind cluster the amount of files are low, and backup takes a very short time to succeeed.

I use a schedule which I have written this way: velero schedule create workload-daily --default-volumes-to-fs-backup --ttl 336h --exclude-namespaces kube-system,kube-public,kube-node-lease,nsx-system,pks-system,velero,vmware-system-csi,konmon --schedule="0 17 * * *" --include-cluster-resources Velero v1 11 1 - Backup times

Then I took and installed Velero version 1.12.2, and now the same backup on the same files takes a much longer time. Nothing else has changed, I create the same schedule and take a backup. Velero-v1 12 2 backup times

1 minute vs 9 minutes with only changing the velero version in a Kind-cluster. In our development K8s-cluster the difference was 35 minutes.

Environment:

Velero version (use velero version): v1.11.1, and v1.12.2
Velero features (use velero client config get features): No features set.
Kubernetes version (use kubectl version): Both v1.25.11 (Kind cluster with above tests performed) and v1.26.10+vmware.1 (TKGI by vmware in our dev cluster)
Kubernetes installer & version: VMware
OS (e.g. from /etc/os-release): Kind runs on RHEL9. TKGI runs on our own vmware-enviroment with tkgi 1.17.2 and Vsphere 8.

Velero debug logs from both test cases: bundle-2024-01-11-13-24-42-v1.12.2.tar.gz bundle-2024-01-11-11-12-46-v1.11.1.tar.gz

Any ideas on what have changed between Velero version 1.11.1 and version 1.12.2?

Thanks in advance!

reasonerjt commented 10 months ago

@kskurtveit Thanks @kskurtveit for reporting this issue. By checking the log in the bundles it seems the performance issue was caused by slowness in handling PVBs. I found such log messages in node-agent in the v1.12.2 bundle:

time="2024-01-11T12:15:23Z" level=info msg="Got volume dir" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/exposer/host_path.go:48" pod UID=5f7361f9-9062-4f4e-9b17-32249bc714d0 pod name=automation-hub-postgresql-0 podvolumebackup=velero/workload-daily-20240111121517-54nvr volDir=pvc-f6b5e5de-d6db-4550-b425-110941332b7e volume=data
time="2024-01-11T12:15:23Z" level=info msg="Found path matching glob" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/exposer/host_path.go:68" path="/host_pods/5f7361f9-9062-4f4e-9b17-32249bc714d0/volumes/kubernetes.io~nfs/pvc-f6b5e5de-d6db-4550-b425-110941332b7e" pod UID=5f7361f9-9062-4f4e-9b17-32249bc714d0 pod name=automation-hub-postgresql-0 podvolumebackup=velero/workload-daily-20240111121517-54nvr volume=data
time="2024-01-11T12:15:23Z" level=info msg="Founding existing repo" backupLocation=backup-kind-skuk logSource="pkg/repository/ensurer.go:86" repositoryType=restic volumeNamespace=automation-hub
time="2024-01-11T12:15:24Z" level=info msg="FileSystemBR is initialized" backup=velero/workload-daily-20240111121517 bsl=backup-kind-skuk controller=podvolumebackup jobName=workload-daily-20240111121517-54nvr logSource="pkg/datapath/file_system.go:110" podvolumebackup=velero/workload-daily-20240111121517-54nvr repository=restic source namespace=automation-hub uploader=restic
time="2024-01-11T12:15:24Z" level=info msg="Looking for most recent completed PodVolumeBackup for this PVC" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:287" podvolumebackup=velero/workload-daily-20240111121517-54nvr pvcUID=f6b5e5de-d6db-4550-b425-110941332b7e
time="2024-01-11T12:15:24Z" level=info msg="Found most recent completed PodVolumeBackup for PVC" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:335" parentPodVolumeBackup=workload-daily-20240111101118-nlj8r parentSnapshotID= podvolumebackup=velero/workload-daily-20240111121517-54nvr pvcUID=f6b5e5de-d6db-4550-b425-110941332b7e
time="2024-01-11T12:15:24Z" level=info msg="No parent snapshot found for PVC, not based on parent snapshot for this backup" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:175" podvolumebackup=velero/workload-daily-20240111121517-54nvr
time="2024-01-11T12:15:24Z" level=info msg="Async fs backup data path started" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:185" path="/host_pods/5f7361f9-9062-4f4e-9b17-32249bc714d0/volumes/kubernetes.io~nfs/pvc-f6b5e5de-d6db-4550-b425-110941332b7e" podvolumebackup=velero/workload-daily-20240111121517-54nvr
time="2024-01-11T12:15:24Z" level=info msg="PodVolumeBackup starting" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:117" podvolumebackup=velero/workload-daily-20240111121517-bqs5j
time="2024-01-11T12:15:25Z" level=error msg="Restic command fail with ExitCode: 1. Process ID is 22, Exit error is: exit status 1" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/util/exec/exec.go:66" parentSnapshot= path="/host_pods/5f7361f9-9062-4f4e-9b17-32249bc714d0/volumes/kubernetes.io~nfs/pvc-f6b5e5de-d6db-4550-b425-110941332b7e" podvolumebackup=velero/workload-daily-20240111121517-54nvr
time="2024-01-11T12:15:25Z" level=info msg="Async fs backup data path completed" PVB=workload-daily-20240111121517-54nvr controller=PodVolumeBackup logSource="pkg/controller/pod_volume_backup_controller.go:195" pvb=workload-daily-20240111121517-54nvr
time="2024-01-11T12:15:25Z" level=info msg="PodVolumeBackup completed" controller=PodVolumeBackup logSource="pkg/controller/pod_volume_backup_controller.go:225" pvb=workload-daily-20240111121517-54nvr
time="2024-01-11T12:15:25Z" level=info msg="FileSystemBR is closed" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/datapath/file_system.go:129" podvolumebackup=velero/workload-daily-20240111121517-54nvr user=workload-daily-20240111121517-54nvr
time="2024-01-11T12:16:24Z" level=info msg="PodVolumeBackup starting" backup=velero/workload-daily-20240111121517 controller=podvolumebackup logSource="pkg/controller/pod_volume_backup_controller.go:117" podvolumebackup=velero/workload-daily-20240111121517-bqs5j

After handling workload-daily-20240111121517-54nvr, instead of handle the next PVB immediately the node-agent waited for 1min, and this should be caused by the change we made in v1.12 to make node-agent handle concurrent data transfer, where the PVB was requeued and re-fetched in a min. https://github.com/vmware-tanzu/velero/blob/b57dde157207da80e76322520ebd3795e3a194a8/pkg/controller/pod_volume_backup_controller.go#L126-L125

This impact the performance when there are PVBs in a row and each one takes very little time in node-agent.

@Lyndon-Li we may consider to implement a queue in memory instead of relying on the controller-runtime here, such that the next PVB can be handled immediately instead of waiting for 1 min.

Another option is to reduce the value of RequeueAfter but it may introduce unnecessary frequent entrance of Reconcile when there are volumes taking very long to handle.

Lyndon-Li commented 10 months ago

From 1.12, node-agent controller supports to run data path activities concurrently, there is a concurrentNum as the limit for each controller to run the concurrent task. If the controller reaches the limit, it gives up processing the request and requeue it after 1min.

For small volumes, looks like this 1min time is too long.

The ideal way is that when the controller completes one request, it asks the controller manager to requeue a request immediately. However, the current controller-runtime doesn't provide this functionality. Therefore, we will adjust this requeue time period to a shorter time, e.g., 5 seconds.

vmware-tanzu / velero

Sudden increase in backup time duration for Velero v1.12.2 #7308