vmware-tanzu / velero

Backup and migrate Kubernetes applications and their persistent volumes
https://velero.io
Apache License 2.0
8.45k stars 1.37k forks source link

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

Open duduvaa opened 10 months ago

duduvaa commented 10 months ago

What steps did you take and what happened:

Running datamover backup and restore. During the tests, monitoring the datauploads and datadownloads resources and noticed the resources weren't distributed equally among the workers. It causes the tests to run long duration, also while running a few cycles of the same test - the results are inconsistent.

Another issue - the test is not run with the max concurrent (1 resource per node)

What did you expect to happen: The resources should distributed among all workers equally as much as it can.

The following information will help us better understand what's going on:

Anything else you would like to add:

5 backup cycles duration and datauploads distributed (ns with 100 PVs): -0:23:48 worker000-r640 : 3, worker001-r640 : 61, worker002-r640 : 0, worker003-r640 : 4, worker004-r640 : 32, worker005-r640 : 0 -0:16:39 worker000-r640 : 11, worker001-r640 : 23, worker002-r640 : 14, worker003-r640 : 20, worker004-r640 : 19,worker005-r640 : 13 -0:17:53 worker000-r640 : 20, worker001-r640 : 17, worker002-r640 : 15, worker003-r640 : 16, worker004-r640 : 9, worker005-r640 : 23 -0:18:45 worker000-r640 : 24, worker001-r640 : 15, worker002-r640 : 22, worker003-r640 : 17, worker004-r640 : 6, worker005-r640 : 16 -0:28:39 worker000-r640 : 26, worker001-r640 : 15, worker002-r640 : 20, worker003-r640 : 20, worker004-r640 : 2, worker005-r640 : 17

5 restore cycles datadownloads distributed (ns with 100 PVs): -worker000-r640 : 5, worker001-r640 : 51, worker002-r640 : 0, worker003-r640 : 17, worker004-r640 : 27, worker005-r640 : 0 -worker000-r640 : 24, worker001-r640 : 13, worker002-r640 : 0, worker003-r640 : 22, worker004-r640 : 24, worker005-r640 : 17 -worker000-r640 : 28, worker001-r640 : 12, worker002-r640 : 10, worker003-r640 : 23, worker004-r640 : 14, worker005-r640 : 13 -worker000-r640 : 21, worker001-r640 : 18, worker002-r640 : 10, worker003-r640 : 18, worker004-r640 : 17, worker005-r640 : 16 -worker000-r640 : 15, worker001-r640 : 17, worker002-r640 : 11, worker003-r640 : 21, worker004-r640 : 21, worker005-r640 : 15

5 restore cycles duration and datadownloadloads distributed (ns with 50 PVs): -0:29:39 worker000-r640 : 0, worker001-r640 : 29, worker002-r640 : 0, worker003-r640 : 20, worker004-r640 : 1, worker005-r640 : 0 -0:17:20 worker000-r640 : 0, worker001-r640 : 20, worker002-r640 : 0, worker003-r640 : 14, worker004-r640 : 16, worker005-r640 : 0 -0:19:32 worker000-r640 : 2, worker001-r640 : 20, worker002-r640 : 11, worker003-r640 : 6, worker004-r640 : 11, worker005-r640 : 0 -0:18:29 worker000-r640 : 0, worker001-r640 : 14, worker002-r640 : 3, worker003-r640 : 13, worker004-r640 : 18, worker005-r640 : 2 -0:14:26 worker000-r640 : 1, worker001-r640 : 11, worker002-r640 : 14, worker003-r640 : 8, worker004-r640 : 12, worker005-r640 : 4

Environment:

cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="412.86.202303211731-0" VERSION_ID="4.12" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/" BUG_REPORT_URL="https://access.redhat.com/labs/rhir/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.12" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.12" OPENSHIFT_VERSION="4.12" RHEL_VERSION="8.6" OSTREE_VERSION="412.86.202303211731-0"

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

sseago commented 10 months ago

Which node agent pod handles a DataUpload or DataDownload is determined by what node the backupPod or restorePod is running on. Currently Velero is creating these pods without any particular configuration to restrict or control where they run, so the node distribution is determined by the kubernetes scheduler, not by velero. We could consider modifying this via node selectors, affinity, or topology spread constraints -- the latter may be the way to go here.

shawn-hurley commented 10 months ago

Something to consider: the scheduler knows better than us what the current constraints are on each node I would worry about that artificially spreading out the resources may cause other issues (like overcommitting a very important node) or something along those lines.

I would be very cautious about getting into the scheduling game, IMO. I think a better option is to work on making each node able to handle more than 1.

As for inconsistent performance results, isn't that pretty vindictive of something running on K8s? that there is a probable range for performance or am I incorrect on this thought process (This is for me learning :) )

sseago commented 10 months ago

@shawn-hurley Hmm, yeah, it may be better leaving this as-is. Looking back at the posted distribution above, it strikes me that for many of the runs, they're actually reasonably well-distributed, although with certain nodes having well less than average. But maybe those nodes were already overcommitted at that time?

As for each node handling more than one at a time, there's already an issue opened for that and it's targeted for 1.13.

Lyndon-Li commented 10 months ago

We have discussed this topic during the initial data mover discussions --- Velero's own load balancer:

  1. On one hand, Kubernetes scheduler knows the CPU and memory resources well and it also knows the affinities and topologies which are all required by Velero data mover workload distribution
  2. On the other hand, Kubernetes scheduler doesn't handle some other requirements that Velero data mover workload distribution cares, for example, the network bandwidth usage, if CPU and memory are sufficient in all nodes, Kubernetes scheduler may assign multiple backup/restore pods in one node, but the network bandwidth is probably the bottleneck. Moreover, in this case, even though the network bandwidth is sufficient, Velero has the concurrency config for each node, for which Kubernetes scheduler doesn't consider either.

Therefore, the ultimate solution may a combination of Kubernetes scheduler and Velero's own load balancer.

sseago commented 10 months ago

@Lyndon-Li If we do our own, we may want to make it configurable -- turning it on or off (not sure which the default should be) -- that way if one option is providing bad performance, users could try the other.

Lyndon-Li commented 10 months ago

As mentioned above, we need the capability of Kubernetes scheduler as well as some supplements. Ideally, we make a combination --- Velero only implements the supplements, the Kubernetes scheduler works as is together with Velero's part. Then we will not need a fallback. Otherwise, if we cannot make them work together and Velero has to implement Kubernetes scheduler's part as well, then we will need to make it configurable in case that Velero's implementation is with bugs or out of sync with the latest Kubernetes system.

Lyndon-Li commented 9 months ago

Reopen this issue, as #6926 has not completely fixed the problem --- the restore part is not fixed; and even for the backup part, there is not as much intelligence to assign data upload overhead as a LD provides.

Let's keep the issue open for new ideas of fixes.

kaovilai commented 5 months ago

Is Design for data mover node selection #7383 related?

Lyndon-Li commented 5 months ago

Is Design for data mover node selection #7383 related?

No, #7383 is for node selection (include/exclude nodes), instead of spreading VGDP in the nodes