DataMover - datauploads and datadownloads resources aren't distributed equally among the workers.

duduvaa commented 10 months ago

What steps did you take and what happened:

Running datamover backup and restore. During the tests, monitoring the datauploads and datadownloads resources and noticed the resources weren't distributed equally among the workers. It causes the tests to run long duration, also while running a few cycles of the same test - the results are inconsistent.

Another issue - the test is not run with the max concurrent (1 resource per node)

What did you expect to happen: The resources should distributed among all workers equally as much as it can.

The following information will help us better understand what's going on:

Anything else you would like to add:

5 backup cycles duration and datauploads distributed (ns with 100 PVs): -0:23:48 worker000-r640 : 3, worker001-r640 : 61, worker002-r640 : 0, worker003-r640 : 4, worker004-r640 : 32, worker005-r640 : 0 -0:16:39 worker000-r640 : 11, worker001-r640 : 23, worker002-r640 : 14, worker003-r640 : 20, worker004-r640 : 19,worker005-r640 : 13 -0:17:53 worker000-r640 : 20, worker001-r640 : 17, worker002-r640 : 15, worker003-r640 : 16, worker004-r640 : 9, worker005-r640 : 23 -0:18:45 worker000-r640 : 24, worker001-r640 : 15, worker002-r640 : 22, worker003-r640 : 17, worker004-r640 : 6, worker005-r640 : 16 -0:28:39 worker000-r640 : 26, worker001-r640 : 15, worker002-r640 : 20, worker003-r640 : 20, worker004-r640 : 2, worker005-r640 : 17

5 restore cycles datadownloads distributed (ns with 100 PVs): -worker000-r640 : 5, worker001-r640 : 51, worker002-r640 : 0, worker003-r640 : 17, worker004-r640 : 27, worker005-r640 : 0 -worker000-r640 : 24, worker001-r640 : 13, worker002-r640 : 0, worker003-r640 : 22, worker004-r640 : 24, worker005-r640 : 17 -worker000-r640 : 28, worker001-r640 : 12, worker002-r640 : 10, worker003-r640 : 23, worker004-r640 : 14, worker005-r640 : 13 -worker000-r640 : 21, worker001-r640 : 18, worker002-r640 : 10, worker003-r640 : 18, worker004-r640 : 17, worker005-r640 : 16 -worker000-r640 : 15, worker001-r640 : 17, worker002-r640 : 11, worker003-r640 : 21, worker004-r640 : 21, worker005-r640 : 15

5 restore cycles duration and datadownloadloads distributed (ns with 50 PVs): -0:29:39 worker000-r640 : 0, worker001-r640 : 29, worker002-r640 : 0, worker003-r640 : 20, worker004-r640 : 1, worker005-r640 : 0 -0:17:20 worker000-r640 : 0, worker001-r640 : 20, worker002-r640 : 0, worker003-r640 : 14, worker004-r640 : 16, worker005-r640 : 0 -0:19:32 worker000-r640 : 2, worker001-r640 : 20, worker002-r640 : 11, worker003-r640 : 6, worker004-r640 : 11, worker005-r640 : 0 -0:18:29 worker000-r640 : 0, worker001-r640 : 14, worker002-r640 : 3, worker003-r640 : 13, worker004-r640 : 18, worker005-r640 : 2 -0:14:26 worker000-r640 : 1, worker001-r640 : 11, worker002-r640 : 14, worker003-r640 : 8, worker004-r640 : 12, worker005-r640 : 4

Environment:

Velero version: main (Velero-1.12) , last commit commit 30e54b026f34c17e309791e0180c68f13e58c4d9 (HEAD -> main, origin/main, origin/HEAD) Author: Daniel Jiang jiangd@vmware.com Date: Wed Aug 16 15:45:00 2023 +0800
Velero features (use velero client config get features):

./velero client config get features

features:
Kubernetes version (use kubectl version):

oc version

Client Version: 4.12.9 Kustomize Version: v4.5.7 Server Version: 4.12.9 Kubernetes Version: v1.25.7+eab9cc9
Cloud provider or hardware configuration: OCP running over BM servers 3 masters & 6 workers nodes

oc get nodes

NAME STATUS ROLES AGE VERSION master-0 Ready control-plane,master 148d v1.25.7+eab9cc9 master-1 Ready control-plane,master 148d v1.25.7+eab9cc9 master-2 Ready control-plane,master 148d v1.25.7+eab9cc9 worker000-r640 Ready worker 148d v1.25.7+eab9cc9 worker001-r640 Ready worker 148d v1.25.7+eab9cc9 worker002-r640 Ready worker 148d v1.25.7+eab9cc9 worker003-r640 Ready worker 148d v1.25.7+eab9cc9 worker004-r640 Ready worker 148d v1.25.7+eab9cc9 worker005-r640 Ready worker 148d v1.25.7+eab9cc9
OS (e.g. from /etc/os-release): Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 Part of OpenShift 4.12, RHCOS is a Kubernetes native operating system

cat /etc/os-release NAME="Red Hat Enterprise Linux CoreOS" ID="rhcos" ID_LIKE="rhel fedora" VERSION="412.86.202303211731-0" VERSION_ID="4.12" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux CoreOS 412.86.202303211731-0 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::coreos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://docs.openshift.com/container-platform/4.12/" BUG_REPORT_URL="https://access.redhat.com/labs/rhir/" REDHAT_BUGZILLA_PRODUCT="OpenShift Container Platform" REDHAT_BUGZILLA_PRODUCT_VERSION="4.12" REDHAT_SUPPORT_PRODUCT="OpenShift Container Platform" REDHAT_SUPPORT_PRODUCT_VERSION="4.12" OPENSHIFT_VERSION="4.12" RHEL_VERSION="8.6" OSTREE_VERSION="412.86.202303211731-0"

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

:+1: for "I would like to see this bug fixed as soon as possible"
:-1: for "There are more important bugs to focus on right now"

sseago commented 10 months ago

Which node agent pod handles a DataUpload or DataDownload is determined by what node the backupPod or restorePod is running on. Currently Velero is creating these pods without any particular configuration to restrict or control where they run, so the node distribution is determined by the kubernetes scheduler, not by velero. We could consider modifying this via node selectors, affinity, or topology spread constraints -- the latter may be the way to go here.

shawn-hurley commented 10 months ago

Something to consider: the scheduler knows better than us what the current constraints are on each node I would worry about that artificially spreading out the resources may cause other issues (like overcommitting a very important node) or something along those lines.

I would be very cautious about getting into the scheduling game, IMO. I think a better option is to work on making each node able to handle more than 1.

As for inconsistent performance results, isn't that pretty vindictive of something running on K8s? that there is a probable range for performance or am I incorrect on this thought process (This is for me learning :) )

sseago commented 10 months ago

@shawn-hurley Hmm, yeah, it may be better leaving this as-is. Looking back at the posted distribution above, it strikes me that for many of the runs, they're actually reasonably well-distributed, although with certain nodes having well less than average. But maybe those nodes were already overcommitted at that time?

As for each node handling more than one at a time, there's already an issue opened for that and it's targeted for 1.13.

Lyndon-Li commented 10 months ago

We have discussed this topic during the initial data mover discussions --- Velero's own load balancer:

On one hand, Kubernetes scheduler knows the CPU and memory resources well and it also knows the affinities and topologies which are all required by Velero data mover workload distribution
On the other hand, Kubernetes scheduler doesn't handle some other requirements that Velero data mover workload distribution cares, for example, the network bandwidth usage, if CPU and memory are sufficient in all nodes, Kubernetes scheduler may assign multiple backup/restore pods in one node, but the network bandwidth is probably the bottleneck. Moreover, in this case, even though the network bandwidth is sufficient, Velero has the concurrency config for each node, for which Kubernetes scheduler doesn't consider either.

Therefore, the ultimate solution may a combination of Kubernetes scheduler and Velero's own load balancer.

sseago commented 10 months ago

@Lyndon-Li If we do our own, we may want to make it configurable -- turning it on or off (not sure which the default should be) -- that way if one option is providing bad performance, users could try the other.

Lyndon-Li commented 10 months ago

As mentioned above, we need the capability of Kubernetes scheduler as well as some supplements. Ideally, we make a combination --- Velero only implements the supplements, the Kubernetes scheduler works as is together with Velero's part. Then we will not need a fallback. Otherwise, if we cannot make them work together and Velero has to implement Kubernetes scheduler's part as well, then we will need to make it configurable in case that Velero's implementation is with bugs or out of sync with the latest Kubernetes system.

Lyndon-Li commented 9 months ago

Reopen this issue, as #6926 has not completely fixed the problem --- the restore part is not fixed; and even for the backup part, there is not as much intelligence to assign data upload overhead as a LD provides.

Let's keep the issue open for new ideas of fixes.

kaovilai commented 5 months ago

Is Design for data mover node selection #7383 related?

Lyndon-Li commented 5 months ago

Is Design for data mover node selection #7383 related?

No, #7383 is for node selection (include/exclude nodes), instead of spreading VGDP in the nodes

vmware-tanzu / velero

DataMover - datauploads and datadownloads resources aren't distributed equally among the workers. #6734

./velero client config get features

oc version

oc get nodes