Eliminate latency phase from density test

wojtek-t commented 4 years ago

PodStartup from SaturationPhase signiifcantly dropped recently:

Screenshot from 2020-06-05 13-58-57

This seem to be the diff between those two runs: https://github.com/kubernetes/kubernetes/compare/bded41a81...1700acb03

We should understand why that happened, as if this is expected (not a bug somewhere), this would potentially allow us to achieve our long-standing goal to get rid of latency phase from density test completely.

@mm4tt @mborsz - FYI

wojtek-t commented 4 years ago

Nothing interesting happened neither in test-infra nor in perf-tests repo at that time...

mm4tt commented 4 years ago

Interesting. Looks like we had a similar drop in load test pod startup, but it got back to normal after a few runs and it completely doesn't coincide in time: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=E2E&metricname=LoadPodStartup&Metric=pod_startup 5L9Y6Eh6jaZ

Strange...

fejta-bot commented 4 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

wojtek-t commented 4 years ago

/remove-lifecycle stale

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale

wojtek-t commented 3 years ago

/remove-lifecycle stale

mborsz commented 3 years ago

I checked https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1335992827380764672/ where we see 57s latency for only stateless pods.

Sample slow pod:

{test-xrxpvb-40/small-deployment-63-677d88d774-xgj2d 1m5.155890502s}

Timeline:

Created at 18:05:11.015687

I1207 18:05:11.015687      11 event.go:291] "Event occurred" object="test-xrxpvb-40/small-deployment-63-677d88d774" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: small-deployment-63-677d88d774-xgj2d"

Scheduled at 18:06:10.760230

I1207 18:06:10.760230      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-40/small-deployment-63-677d88d774-xgj2d" node="gce-scale-cluster-minion-group-2-v779" evaluatedNodes=500 feasibleNodes=500

Running at 18:06:15.139943

kube-apiserver.log-20201207-1607369112.gz:I1207 18:06:15.139943      11 httplog.go:89] "HTTP" verb="PATCH" URI="/api/v1/namespaces/test-xrxpvb-40/pods/small-deployment-63-677d88d774-xgj2d/status" latency="7.568642ms" userAgent="kubelet/v1.21.0 (linux/amd64) kubernetes/e1c617a" srcIP="10.40.7.99:35600" resp=200

So it looks like most of the time it was waiting on kube-scheduler, but also latency on kubelet was higher than expected (~5s).

In kube-scheduler logs I see a block of daemonset schedule events right before small-deployment-63-677d88d774-xgj2d has been created.

I1207 18:05:03.359461      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-79d6p" node="gce-scale-cluster-minion-group-4-pvnb" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.394203      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-vtjxw" node="gce-scale-cluster-minion-group-3-srr7" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.411793      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-hpcmd" node="gce-scale-cluster-minion-group-4-w7r7" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.425993      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-62xjz" node="gce-scale-cluster-minion-group-3-w142" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.432111      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-zvnxl" node="gce-scale-cluster-minion-group-3-61mt" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.446798      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-czxw7" node="gce-scale-cluster-minion-group-1-z28j" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.457188      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-xw4g5" node="gce-scale-cluster-minion-group-1-bfqp" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.484511      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-5cff4" node="gce-scale-cluster-minion-group-bj2t" evaluatedNodes=5001 feasibleNodes=1
(...)
I1207 18:06:04.470902      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-flv5s" node="gce-scale-cluster-minion-group-1-2r82" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.481599      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-zmvzg" node="gce-scale-cluster-minion-group-f2zv" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.493779      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-sczfs" node="gce-scale-cluster-minion-group-3-gd6s" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.505584      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-k5zkx" node="gce-scale-cluster-minion-group-3-3437" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.513842      11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-bfq96" node="gce-scale-cluster-minion-group-1-0vjk" evaluatedNodes=5001 feasibleNodes=1

So between 18:05:03 and 18:06:04 kube-scheduler was scheduling only daemonsets.

So it looks like due to fact that replicaset-controller and daemonset-controller have separate rate limiters for api calls, they can generate more pods (200 qps/s) than kube-scheduler is able to schedule (100 qps/s) and managed to generate O(minute) backlog of work that slowed down "normal" pods binding.

wojtek-t commented 3 years ago

This is great finding. Given we already have P&F enabled, I'm wondering if we can consider bumping scheduler QPS limits in our 5k-node job to 200 to accommodate for those scenarios (without changing anything on controller-manager side). This would also allow us to validate that further.

@mborsz - once https://github.com/kubernetes/kubernetes/issues/97798 is debugged and fixed, WDYT about this?

mm4tt commented 3 years ago

Another important bit of information here is that the deamonset in the load test has its own priorityClass (https://github.com/kubernetes/perf-tests/blob/6aa08f8817fd347b3ccf4d18d29260ce2f57a0a1/clusterloader2/testing/load/daemonset-priorityclass.yaml.). This is why the daemonset pods are starving other pods during scheduling phase.

I believe the main issue here is that one of the load test assumptions is that it should create pods with the given throughput (set via LOAD_TEST_THROUGHPUT). This assumption stopped being true when we introduced daemonsets. What's more, given that deamonsets have higher priority than any other pods in the tests, it's really hard to actually create/update them in parallel with other pods without running into this kind of issues.

I discussed this with Wojtek, and I believe with both agree that it might make sense to move the daemonset operations to a separate CL2 Step. Because steps are executed in serial, it'll stop creating/updating deamonsets in parallel with other pods. This might make the load test minimally slower, but will definitely help with the issue that Maciek found.

AIs

[ ] Move the daemonset creation phase to a separate step. We'll also need to wait for the daemonset pods to be created before we start creating other objects. To summarize the current flow is:
1. Create objects (including daemonsets)
2. Wait for pods (including daemonsets) to be running (code
It should be changed to:
1. Create daemonsets
2. Wait for daemonset pods to be running
3. Create objects (excluding daemonsets)
4. Wait for pods to be running (exluding daemonsets)
[ ] Do the same for the daemonset rolling upgrade phase - https://github.com/kubernetes/perf-tests/blob/ece6960810e96260f6b99443c0280de4d384b13b/clusterloader2/testing/load/config.yaml#L598-L608

/good-first-issue /help-wanted

k8s-ci-robot commented 3 years ago

@mm4tt: This request has been marked as suitable for new contributors.

Please ensure the request meets the requirements listed here.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-good-first-issue command.

In response to [this](https://github.com/kubernetes/perf-tests/issues/1311): >Another important bit of information here is that the deamonset in the load test has its own priorityClass (https://github.com/kubernetes/perf-tests/blob/6aa08f8817fd347b3ccf4d18d29260ce2f57a0a1/clusterloader2/testing/load/daemonset-priorityclass.yaml.). This is why the daemonset pods are starving other pods during scheduling phase. > >I believe the main issue here is that one of the load test assumptions is that it should create pods with the given throughput (set via `LOAD_TEST_THROUGHPUT`). This assumption stopped being true when we introduced daemonsets. What's more, given that deamonsets have higher priority than any other pods in the tests, it's really hard to actually create/update them in parallel with other pods without running into this kind of issues. > >I discussed this with Wojtek, and I believe with both agree that it might make sense to move the daemonset operations to a separate CL2 Step. Because steps are executed in serial, it'll stop creating/updating deamonsets in parallel with other pods. This might make the load test minimally slower, but will definitely help with the issue that Maciek found. > >### AIs > >1. [ ] Move the [daemonset creation phase ](https://github.com/kubernetes/perf-tests/blob/ece6960810e96260f6b99443c0280de4d384b13b/clusterloader2/testing/load/config.yaml#L220-L231) to a separate step. We'll also need to wait for the daemonset pods to be created before we start creating other objects. To summarize the current flow is: > 1. Create objects (including daemonsets) > 2. Wait for pods (including daemonsets) to be running ([code](https://github.com/kubernetes/perf-tests/blob/ece6960810e96260f6b99443c0280de4d384b13b/clusterloader2/testing/load/config.yaml#L360) > > It should be changed to: > 1. Create daemonsets > 2. Wait for daemonset pods to be running > 3. Create objects (excluding daemonsets) > 4. Wait for pods to be running (exluding daemonsets) > >2. [ ] Do the same for the daemonset rolling upgrade phase - https://github.com/kubernetes/perf-tests/blob/ece6960810e96260f6b99443c0280de4d384b13b/clusterloader2/testing/load/config.yaml#L598-L608 > >/good-first-issue >/help-wanted Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

fejta-bot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

wojtek-t commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 3 years ago

Issues go stale after 90d of inactivity. Mark the issue as fresh with /remove-lifecycle stale. Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale

wojtek-t commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 3 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 3 years ago

/remove-lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 2 years ago

/remove-lifecycle stale

wojtek-t commented 2 years ago

I was wondering about why the actual graphs in perf-dash are counter-intuitive and I filed #2006 for that. This explains why the phases that involve schedule time seem much better than they are in reality.

wojtek-t commented 2 years ago

The finding from the experiment of increased scheduling throughput:

pretty much nothing changed in terms of metrics
the reason seem to be: https://github.com/kubernetes/kubernetes/issues/108606

wojtek-t commented 2 years ago

With https://github.com/kubernetes/kubernetes/pull/108648 merged, we have mitigated the problem described above.

The outcome is that our pod-startup latency drastically dropped as visible on the graph: Screenshot from 2022-03-21 08-16-24

Now we're at the point, that (for stateless pods, stateful ones are expected to take longer time to start in many cases), our 99th percentile is generally around 6.5s.

I took a brief look into the last run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1505590326146895872

And I skimmed through 20 slow-to-start pods. All of them seem to be exhibiting exactly the same pattern, e.g. one group of slow pods:

I0320 19:05:34.406141      11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-28/medium-deployment-15-778d5d76f6-r9t6s" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=501 feasibleNodes=500
I0320 19:05:34.950690      11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-24/small-deployment-48-5d8c899cfb-tvjjd" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=500 feasibleNodes=500
I0320 19:05:35.190715      11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-32/medium-deployment-8-6787c8f556-cbfjg" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=501 feasibleNodes=500
I0320 19:05:35.224297      11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-46/big-deployment-1-54559d54b4-5pzjp" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=501 feasibleNodes=500
I0320 19:05:35.338402      11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-1/daemonset-0-24ntx" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=1 feasibleNodes=1

So we're effectively scheduling 5 pods on the same node within <1s. This in turn results in slow operations at the containerd level, e.g.:

time="2022-03-20T19:05:35.533055508Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:medium-deployment-15-778d5d76f6-r9t6s,Uid:63ab1c9f-0cd8-4073-9676-26c32ac6a57c,Namespace:test-htamfw-28,Attempt:0,}"
time="2022-03-20T19:05:44.879770896Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:medium-deployment-15-778d5d76f6-r9t6s,Uid:63ab1c9f-0cd8-4073-9676-26c32ac6a57c,Namespace:test-htamfw-28,Attempt:0,} returns sandbox id \"ad6a4e66b7a7f4f203531a162b56153eb4fdadd7756b47cc1faaf2fb3c8495ef\""

[so we waited almost 10s for creation of pod-sandbox]

OTOH - we know that containerd is able to keep up with 5pods/s throughput from the containerd node-throughput test: https://perf-dash.k8s.io/#/?jobname=containerd-node-throughput&metriccategoryname=E2E&metricname=PodStartup&Metric=pod_startup

In our case, the problem is purely resource-starvation at the node level (with containerd not having enough resources to perform its job fast enough, especially given that kube-proxy can consume a lot of resources too).

If we want to bring pod-startup in our tests below 5s, we more-or-less have two options: 1) increase nodes [which would be really expensive, especially given that we would have to bump all tests, including presubmits, etc.] 2) somehow ensure that the churn on a given node will not be that high...

There is still a question what caused scheduler to put 5 pods on the same node so quickly - we know that one other pod was deleted right before that, but that should only result in 1 more pod jumping into that place. It's unclear to me why we got 5 pods schedule to the same node within that short period (there were ~150 pods scheduled during that time).

The potentially useful piece of information is that it was the time when daemonset was actually being updated too.

ahg-g commented 2 years ago

I took a brief look into the last run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1505590326146895872

where can I find at the scheduler logs?

And I skimmed through 20 slow-to-start pods. All of them seem to be exhibiting exactly the same pattern, e.g. one group of slow pods:

Perhaps they where placed on a newly scaled up node? by default the preference is to select nodes with lower usage.

wojtek-t commented 2 years ago

where can I find at the scheduler logs?

@ahg-g - you click at "Master and node logs" towards bottom of the page and then on the gcs link and then on the master and they are here: https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1505590326146895872/gce-scale-cluster-master/

Perhaps they where placed on a newly scaled up node? by default the preference is to select nodes with lower usage.

Nope - this cluster has static size.

wojtek-t commented 2 years ago

I looked a bit more into and all long-starting pods I've seen were due to the fact there were a bunch of pods starting on a node at the same time. In other words - improving spreading should visibly help the situation here.

So the question is why we're spreading the pods so unequally here (e.g. in the last run 5k-test, in scheduling-throughput phase the 5k pods were scheduled on ~2850 nodes). And technically ~all nodes should be feasible.

Assuming I didn't miss anything above, it has to be related to the scoring functions - we have the following:

image locality - irrelevant - pause image should already be everywhere
inter-pod-affinity - irrelevant - we don't have pod-affinities for these pods
node-affinity - irrelevant - we don't have node affinities
balanced-allocation - relevant
node-resources (lease-allocated) - relevant
pod-topology-spread - irrelevant - we're not using pod-topology spreading
selector-spread - relevant
taint-toleration - irrelevant - we don't have any sophisticaed taints
volume-binding - irrelevant - those pods don't have volumes

Balanced allocation - since our pods are fairly small (1m/5m/10m CPU, 10M RAM) - in a (1 core, 4GB RAM) - the impact coming from a single additional pod should be small
For least-allocated - it's fairly similar to the above
For selector-spread - scheduling a second pod on the same node should actually have a huge impact - given all nodes are in a single zone, assuming that there is at most 1 pod from that deployment on any node, putting it on node without such pod should result in score 100, vs putting it on node with such pod would result in score 0. That's a substantial difference that should actually result in spreading those pods.

The above doesn't hold for the "create/update" phase, where not all pods are from the same ownerRef, In this situation, the first two would be generally the dominating ones. While we won't get perfect here, consuming equal percentage of node's RAM and CPU by all pods could help a lot here.

We may claim that it's playing with a test to make that pass, but in reality we would just put bigger nodes to avoid resource starvation and we don't want that here to reduce costs.

wojtek-t commented 2 years ago

I played a bit more with this in https://github.com/kubernetes/kubernetes/pull/109067

In particular this run of 5k presubmit: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109067/pull-kubernetes-e2e-gce-scale-performance-manual/1517436913894559744

Let's focus purely on the e2e-a6cf091dfb-bf35d-minion-group-1-75b5 node from that run. For completeness here are the pods that were scheduled through this run on that node:

wojtekt@wojtekt-l:~/Downloads$ cat out.txt | grep -n "75b5"
1736:I0422 10:03:19.785536      11 schedule_one.go:264] "Successfully bound pod to node" pod="kube-system/metadata-proxy-v0.1-zvzhq" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
2062:I0422 10:03:22.952746      11 schedule_one.go:264] "Successfully bound pod to node" pod="kube-system/konnectivity-agent-jwhlp" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
3888:I0422 10:04:11.467082      11 schedule_one.go:264] "Successfully bound pod to node" pod="kube-system/coredns-6456b64994-l6mdc" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=674 feasibleNodes=674
14455:I0422 10:35:10.779761      11 schedule_one.go:264] "Successfully bound pod to node" pod="gce-pd-csi-driver/csi-gce-pd-node-8jtzn" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
77278:I0422 11:08:22.205770      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-45/medium-deployment-16-76bb9857d7-4ll2r" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
96733:I0422 11:15:47.509886      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-50/small-deployment-14-7d85df95c5-rtb46" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
102709:I0422 11:17:20.890787      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-40/small-deployment-95-6457576f46-c9djr" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
110619:I0422 11:19:46.709317      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-29/big-deployment-0-5789f8949c-v8t4r" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
114933:I0422 11:21:10.480875      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/big-deployment-0-5789f8949c-d56l4" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
117109:I0422 11:21:58.200584      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-16/small-deployment-122-86cb4bb847-w6m7j" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
127985:I0422 11:25:18.870483      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-24/medium-deployment-7-7d48959b7-bsqf6" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
129546:I0422 11:25:29.590314      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-1/daemonset-0-9zp4t" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
132280:I0422 11:25:50.848616      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-42/medium-deployment-17-6bf8597b66-shmvd" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
134716:I0422 11:26:45.950966      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-25/big-deployment-1-7c65455464-n9sdg" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
136920:I0422 11:27:33.499735      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-19/small-deployment-277-86667b5978-8c2rs" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
154684:I0422 11:32:55.710970      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-2/small-deployment-136-555b89c9dd-n4x5j" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
154775:I0422 11:32:58.614263      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-27/medium-deployment-18-797999c589-hw92h" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
157747:I0422 11:34:02.462108      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/big-deployment-1-7c65455464-xt4sx" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
157983:I0422 11:34:05.108176      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-42/small-deployment-270-579bbd889-b46fs" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158046:I0422 11:34:05.735784      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/small-deployment-212-677f957b89-zzrbz" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158073:I0422 11:34:06.011768      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-43/medium-deployment-17-6bf8597b66-b5gjn" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158164:I0422 11:34:07.269574      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-23/big-deployment-0-5789f8949c-77nlk" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158892:I0422 11:34:17.518477      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-6/medium-deployment-10-8d689bcd9-gkcn9" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
172077:I0422 11:52:31.584073      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-51/scheduler-throughput-deployment-2-54794b6b6-kwlkx" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
172104:I0422 11:52:31.816736      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-51/scheduler-throughput-deployment-0-7dc4b8f7cd-4rxvm" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
174159:I0422 11:59:48.878411      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-5/latency-deployment-63-655d645594-2c7p7" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
187413:I0422 12:36:12.706379      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-1/daemonset-0-8zvc9" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1

In particular, what was causing the problem there were this part from the above:

157983:I0422 11:34:05.108176      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-42/small-deployment-270-579bbd889-b46fs" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158046:I0422 11:34:05.735784      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/small-deployment-212-677f957b89-zzrbz" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158073:I0422 11:34:06.011768      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-43/medium-deployment-17-6bf8597b66-b5gjn" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158164:I0422 11:34:07.269574      11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-23/big-deployment-0-5789f8949c-77nlk" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549

where we scheduled 4 pods within 2s on the same node (in that period of time we scheduled ~180 pods in total, so having 4 landing on the same node is a bit unexpected).

So in that PR I added an extended logging (which is killing the test, but that was supposed only to help with debugging). In particular what brought my attention was this:

LLL: pod=test-92du0z-42/small-deployment-270-579bbd889-b46fs node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=72 TS=200 BA=96

LLL: pod=test-92du0z-46/small-deployment-212-677f957b89-zzrbz node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=71 TS=200 BA=97

LLL: pod=test-92du0z-43/medium-deployment-17-6bf8597b66-b5gjn node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=71 TS=200 BA=97

LLL: pod=test-92du0z-23/big-deployment-0-5789f8949c-77nlk node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=70 TS=200 BA=97

The NR and TS scores look reasonable to me, but BA (BalancedAllocation) is not what I expect.

In fact all pods from all deployments we have are setting cpu=5m mem=20MB: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/deployment.yaml#L4-L5 https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/daemonset.yaml#L34-L35

So assuming that the initial pods (system addons and daemons, e.g. coreDNS, csi, ...) occupy A mcpu and B MB of RAM, then basically with n additional test pods we get:

cpu_requested = A + n*5
ram_requested = B + n*20

According to the code of computing balanced allocation: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/noderesources/balanced_allocation.go#L115-L117

with N testing pods we should get

std = | ((A+n*5)/node_cpu - (B+n*20)/node_ram) / 2 |

For e2-medium machines I thought we have 1000 mcpu and 4000 MB RAM, which would give us:

std = | ((4*A+4*n*5)/4*node_cpu - (B+n*20)/node_ram) / 2 | = | ((4*A+20*n)/4000 - (B+n*20)/4000) / 2| = 
  | ((4*A+20*n - (B+20*n))/8000 | = | (4*A-B) / 8000 |

so constant independently from the additional testing pods.

Whereas when grepping through logs for that, I'm getting very different values of BA, e.g.:

wojtekt@wojtekt-l:~/Downloads$ cat kube-scheduler.log | grep LLL | grep "75b5" | grep BA
...
LLL: pod=kube-system/coredns-6456b64994-gbqm9 node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=87 TS=200 BA=94
...
LLL: pod=probes/ping-server-7f5864b8c6-ppvn9 node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=73 TS=200 BA=93
...
LLL: pod=test-92du0z-25/small-deployment-209-678f5b4577-5fql8 node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=77 TS=200 BA=95
...
LLL: pod=test-92du0z-23/big-deployment-0-5789f8949c-dkb7z node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=70 TS=180 BA=97
...
LLL: pod=test-92du0z-3/latency-deployment-53-77574d7dfb-h6k9m node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=66 TS=200 BA=98
...

so very different scores.

So the next step is to validate if the assumption about node allocatable is actually correct. If so, there seem to be some bug in scheduler...

wojtek-t commented 2 years ago

OK - so the assumption above is incorrect - allocatable from the e2-medium node:

I0425 13:29:01.356114   14825 cluster.go:86] Name: e2e-2053-62db2-minion-group-5rzq, clusterIP: 10.40.0.24, externalIP: , isSchedulable: true, allocatable: v1.ResourceList{"attachable-volumes-gce-pd":resource.Quantity{i:resource.int64Amount{value:15, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"15", Format:"DecimalSI"}, "cpu":resource.Quantity{i:resource.int64Amount{value:2, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"2", Format:"DecimalSI"}, "ephemeral-storage":resource.Quantity{i:resource.int64Amount{value:43554137631, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"43554137631", Format:"DecimalSI"}, "hugepages-2Mi":resource.Quantity{i:resource.int64Amount{value:0, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"0", Format:"DecimalSI"}, "memory":resource.Quantity{i:resource.int64Amount{value:3866206208, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}, "pods":resource.Quantity{i:resource.int64Amount{value:110, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"110", Format:"DecimalSI"}}

So basically:

2 cpu
3866206208 bytes

wojtek-t commented 2 years ago

So continuing the above:

the e2-medium machines we're using from high-level perspective have (1cpu, ~4GB of RAM)
the reason why less memory is than 4GB is this setting:
```
HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:250Mi Percentage:0}
```
which effectively reduces the machine size by 250MB
also, empirically I observer amount of memory sometimes exceed the 4GB by up to 100MB
the machine itself observes 2 cores (even though it can't use two cores for the whole time)

So what is happening here is that currently "BalancedAllocation" score goes the other direction than "NodeFitsResources" on many nodes (leading to potentially scheduling multiple pods on the same node within the short period of time).

The simplest way to mitigate it is to adjust the allocatable cpu by setting kubeReserved/systemReserved (the 2cpu isn't true anyway).

wojtek-t commented 2 years ago

OK - so the above moved us a long way (in fact got down to 5s for 100th percentile based on last two runs of 5k-node test).

That said, we're still high on the main "pod startup latency".

I took a quick look at the last run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1519723567455932416

I took at O(10) pods with largest startup_time and almost all of them didn't have any other pods starting on the same node at the same time. So the problem doesn't seem to be scheduling-later anymore.

Picking one of the pods, I'm seeing the following in kubelet logs:

I0428 19:10:25.026495    1680 kubelet.go:2060] "SyncLoop ADD" source="api" pods=[test-4i97xb-45/medium-deployment-20-7857898896-bddht]
...
E0428 19:10:26.534520    1680 projected.go:192] Error preparing data for projected volume kube-api-access-b4wph for pod test-4i97xb-45/medium-deployment-20-7857898896-bddht: [failed to fetch token: serviceaccounts "default" is forbidden: User "system:node:gce-scale-cluster-minion-group-ph7w" cannot create resource "serviceaccounts/token" in API group "" in the namespace "test-4i97xb-45": no relationship found between node 'gce-scale-cluster-minion-group-ph7w' and this object, failed to sync configmap cache: timed out waiting for the condition]
...
E0428 19:10:28.025197    1680 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/configmap/87f89c3c-3837-4bca-a2d6-c5430cfc411a-configmap podName:87f89c3c-3837-4bca-a2d6-c5430cfc411a nodeName:}" failed. No retries permitted until 2022-04-28 19:10:29.02516469 +0000 UTC m=+7114.679135177 (durationBeforeRetry 1s). Error: MountVolume.SetUp failed for volume "configmap" (UniqueName: "kubernetes.io/configmap/87f89c3c-3837-4bca-a2d6-c5430cfc411a-configmap") pod "medium-deployment-20-7857898896-bddht" (UID: "87f89c3c-3837-4bca-a2d6-c5430cfc411a") : failed to sync configmap cache: timed out waiting for the condition
...
I0428 19:10:29.026588    1680 operation_generator.go:703] "MountVolume.SetUp succeeded for volume \"configmap\" (UniqueName: \"kubernetes.io/configmap/87f89c3c-3837-4bca-a2d6-c5430cfc411a-configmap\") pod \"medium-deployment-20-7857898896-bddht\" (UID: \"87f89c3c-3837-4bca-a2d6-c5430cfc411a\") " pod="test-4i97xb-45/medium-deployment-20-7857898896-bddht"

It took longer then expected later too, but that partially may be a consequence of some backoffs or sth.

Looking into other pods, e.g. medium-deployment-2-c99c86f6b-ch699 gives pretty much the same situation.

I initially was finding pods scheduling around the same time. But it's not limited to a single time. As an example, for the pod we're also getting the same:

I0428 18:20:22.921344    1686 kubelet.go:2060] "SyncLoop ADD" source="api" pods=[test-4i97xb-2/small-deployment-62-6db75944b4-jz2gh]
...
E0428 18:20:24.194978    1686 projected.go:192] Error preparing data for projected volume kube-api-access-gbjrx for pod test-4i97xb-2/small-deployment-62-6db75944b4-jz2gh: [failed to fetch token: serviceaccounts "default" is forbidden: User "system:node:gce-scale-cluster-minion-group-ds35" cannot create resource "serviceaccounts/token" in API group "" in the namespace "test-4i97xb-2": no relationship found between node 'gce-scale-cluster-minion-group-ds35' and this object, failed to sync configmap cache: timed out waiting for the condition]
...
I0428 18:20:27.421608    1686 operation_generator.go:703] "MountVolume.SetUp succeeded for volume \"secret\" (UniqueName: \"kubernetes.io/secret/7739fba6-34f5-46be-9f6c-79c9f8477a5d-secret\") pod \"small-deployment-62-6db75944b4-jz2gh\" (UID: \"7739fba6-34f5-46be-9f6c-79c9f8477a5d\") " pod="test-4i97xb-2/small-deployment-62-6db75944b4-jz2gh"

So it seems that latency in the node-authorizer is the biggest bottleneck for pod startup. I will try to look a bit into it.

wojtek-t commented 2 years ago

With PR decreasing the indexing threshold in node-authorizer: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109067/pull-kubernetes-e2e-gce-scale-performance-manual/1521023813385457664

seems to no longer have those, but it didn't really move the needle... It seems to now mostly boil-down to slow nodes on the container runtime side...

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to [this](https://github.com/kubernetes/perf-tests/issues/1311#issuecomment-1262510337): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue with `/reopen` >- Mark this issue as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close not-planned > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

wojtek-t commented 1 year ago

/remove-lifecycle rotten /reopen

k8s-ci-robot commented 1 year ago

@wojtek-t: Reopened this issue.

In response to [this](https://github.com/kubernetes/perf-tests/issues/1311#issuecomment-1321583928): >/remove-lifecycle rotten >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 1 year ago

/remove-lifecycle rotten

wojtek-t commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 1 year ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 1 year ago

/remove-lifecycle stale

k8s-triage-robot commented 9 months ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

wojtek-t commented 9 months ago

/remove-lifecycle stale /lifecycle frozen

kubernetes / perf-tests

Eliminate latency phase from density test #1311

AIs