vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.78k stars 1.41k forks source link

Reduce excessive number of running datamovers relative to loadConcurrency values #8344

Open msfrucht opened 4 weeks ago

msfrucht commented 4 weeks ago

Describe the problem/challenge you have During backup with EnableCSI and snapshot-move-data enabled there is a large number of datamovers started after the initial resource backup finishes.

Some of our applications have upwards of 100 PVCs resulting in nearly a hundred datamover pods.

The existing design model creates the datamovers but only lets them perform the backup up to the loadConcurrency configuration in parallel. While the waiting datamover pods use little in actual resources, the resource limits has to be set to what is required to backing up the largest in resource usage and skews scheduling of other workloads.

The loadConcurrency cannot be increased beyond a certain point or kopia begins to run out memory from the nodes.

This example debug logs illustrate a small example. When loadConcurrency is constrained to 1 as well and the loadAffinity to a single node with 10 PVCs. The result is 10 datamovers only one of which executes.

bundle-2024-10-23-15-47-30.tar.gz

Describe the solution you'd like Do not deploy datamover pods until load concurrency policy allows datamovers to run.

Anything else you would like to add:

Proposed Solution:

  1. New state: DatamoverDeploying

Instead of deploying the datamover during Prepared. Move the concurrency check and the datamover pod creation into this state.

  1. To move from Accepted -> Prepared: Largely remains unchanged but without the datamover pod creation.

Copy the VolumeSnapshot, VolumeSnapshotContents, and create the PVC. On restore, the changes are just create the PVC from the application PVC spec. Do not create the datamover container.

  1. State change: Prepared -> DatamoverDeploying:

The concurrency check occurs on this state. On success, create the datamover Pod.

  1. DatamoverDeploying -> InProgress The DataUpload should move into the Prepared state when datamover pod has phase Running as is.

Environment:

Client: Version: main Git commit: c53ab20d56450d7b8a626e78fec15bfc3fd896f8-dirty Server: Version: main

Only changes are to Makefile and modify Dockerfile.ubi from github.com/openshift/velero to make a velero container compatible with OpenShift. Dockerfile from Velero will not work with OpenShift due permissions denied creating folders. I am aware of my less than stellar reputation. I do not have access to generic Kubernetes at all nor allowed to deploy generic Kubernetes. I have no choice but to use an alternate Dockerfile to get Velero in a container without Red Hat OADP additions and internal modifications.

A copy of the changes to Makefile and added Dockerfile.ubi are available at my branch at https://github.com/msfrucht/openshift-velero/tree/velero_in_openshift

Client Version: 4.15.12 Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3 Server Version: 4.16.8 Kubernetes Version: v1.29.7+4510e9c

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

Lyndon-Li commented 4 weeks ago

This is indeed a problem. Let me further think about this. The goal is to let go the resource that should be occupied by cpu/memory limit, so we may use a loose control and avoid the heavy changes to the existing workflow.

Lyndon-Li commented 4 weeks ago

Some more inputs: By default, backupPod/restorePod is using BestEfforts resource policy, that is, no request or limit is set. Therefore, the thing that impact Kubernetes scheduler is the number of backupPods/restorePods only. Though the impact is also obvious when the number is large, it won't shut pods out as same as the resource request/limit impact.

Therefore, the current issue could be alleviated somehow if users follow the recommendation --- unless necessary, the resource configuration should not be set.

On the other hand, it is not a bad idea to have some overflow for the number of backupPod against loadConcurrency --- the snapshot expose also takes considerable time, it helps to improve the pipeline if we have several backupPods doing the snapshot expose while some others within the loadConcurrency running VGDP.

Therefore, a well-designed strategy which considers node-selection, node-concurrency may make the ultimate solution.

msfrucht commented 4 weeks ago

Yes, I noticed. Which is why I didn't say it blocked scheduling, it only skewed it.

The existing state model allows datamovers to remain in state Pending without issue when pod limits and node limits are hit. I've rarely seen anyone go through and set priority options. And developers tend not to want to set a default due to user priority will vary.

Anything with priority went ahead over datamover Pods in the large applications. Few people bother to set Pod priorities or use PriorityClass (k8s 1.24 required). Tends to be the bottom of the pile for options to implement. And no one wants to go through and set priority for them all, especially in shared environments where admin may not even know what the applications are.

Setting a node-agent-config podResources is absolutely required for something resembling sane scheduling right now to get Burstable or Guaranteed QoS. Once kopia starts the memory usage of the datamover will go from a few MBs to sometimes GBs depending on workload and changes since last backups. Using BestEffort with that behavior is just a recipe for evictions.

Currently testing with user-created evictions with kubectl evict-pod. Clearing out a hundred or so datamover pods is not desirable from failed state progression.

msfrucht commented 2 weeks ago

@Lyndon-Li I've thought of another thing that might help with scheduling besides PriorityClass and priority options.

Currently the node-agent-config podResources have to be set sufficiently high to account for the largest workload in the backup or restore and get Burstable or Guaranteed QoS.

Being able to set resource profiles for sets of volume would make scheduling less disruptive in Burstable and Guaranteed QoS by setting resource profiles more accurately.

This was already an issue with the node-agent backups in Velero 1.14 but you can't exactly swap out the resource value in the middle of a running Pod. Datamover Pods don't have that inherent restriction.

Lyndon-Li commented 2 weeks ago

Currently the node-agent-config podResources have to be set sufficiently high

Actually, the default behavior is BestEfforts, that is no request/limit is set. And this is also the recommended behavior. The thing is, once users allow a node to run data mover pods, it means they accept the impact of the data mover pods, including cpu, memory, network bandwidth, etc., after all, data mover activities are source consuming inevitable.

Being able to set resource profiles for sets of volume would make scheduling less disruptive

Could you share an example of how would the profiles are set? Are you talking about the case when the resource policy is not BestEfforts?