Open wojtek-t opened 4 years ago
Nothing interesting happened neither in test-infra nor in perf-tests repo at that time...
Interesting. Looks like we had a similar drop in load test pod startup, but it got back to normal after a few runs and it completely doesn't coincide in time: http://perf-dash.k8s.io/#/?jobname=gce-5000Nodes&metriccategoryname=E2E&metricname=LoadPodStartup&Metric=pod_startup
Strange...
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-testing, kubernetes/test-infra and/or fejta. /lifecycle stale
/remove-lifecycle stale
I checked https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1335992827380764672/ where we see 57s latency for only stateless pods.
Sample slow pod:
{test-xrxpvb-40/small-deployment-63-677d88d774-xgj2d 1m5.155890502s}
Timeline:
Created at 18:05:11.015687
I1207 18:05:11.015687 11 event.go:291] "Event occurred" object="test-xrxpvb-40/small-deployment-63-677d88d774" kind="ReplicaSet" apiVersion="apps/v1" type="Normal" reason="SuccessfulCreate" message="Created pod: small-deployment-63-677d88d774-xgj2d"
Scheduled at 18:06:10.760230
I1207 18:06:10.760230 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-40/small-deployment-63-677d88d774-xgj2d" node="gce-scale-cluster-minion-group-2-v779" evaluatedNodes=500 feasibleNodes=500
Running at 18:06:15.139943
kube-apiserver.log-20201207-1607369112.gz:I1207 18:06:15.139943 11 httplog.go:89] "HTTP" verb="PATCH" URI="/api/v1/namespaces/test-xrxpvb-40/pods/small-deployment-63-677d88d774-xgj2d/status" latency="7.568642ms" userAgent="kubelet/v1.21.0 (linux/amd64) kubernetes/e1c617a" srcIP="10.40.7.99:35600" resp=200
So it looks like most of the time it was waiting on kube-scheduler, but also latency on kubelet was higher than expected (~5s).
In kube-scheduler logs I see a block of daemonset schedule events right before small-deployment-63-677d88d774-xgj2d has been created.
I1207 18:05:03.359461 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-79d6p" node="gce-scale-cluster-minion-group-4-pvnb" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.394203 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-vtjxw" node="gce-scale-cluster-minion-group-3-srr7" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.411793 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-hpcmd" node="gce-scale-cluster-minion-group-4-w7r7" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.425993 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-62xjz" node="gce-scale-cluster-minion-group-3-w142" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.432111 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-zvnxl" node="gce-scale-cluster-minion-group-3-61mt" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.446798 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-czxw7" node="gce-scale-cluster-minion-group-1-z28j" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.457188 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-xw4g5" node="gce-scale-cluster-minion-group-1-bfqp" evaluatedNodes=5001 feasibleNodes=1
I1207 18:05:03.484511 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-5cff4" node="gce-scale-cluster-minion-group-bj2t" evaluatedNodes=5001 feasibleNodes=1
(...)
I1207 18:06:04.470902 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-flv5s" node="gce-scale-cluster-minion-group-1-2r82" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.481599 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-zmvzg" node="gce-scale-cluster-minion-group-f2zv" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.493779 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-sczfs" node="gce-scale-cluster-minion-group-3-gd6s" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.505584 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-k5zkx" node="gce-scale-cluster-minion-group-3-3437" evaluatedNodes=5001 feasibleNodes=1
I1207 18:06:04.513842 11 scheduler.go:592] "Successfully bound pod to node" pod="test-xrxpvb-1/daemonset-0-bfq96" node="gce-scale-cluster-minion-group-1-0vjk" evaluatedNodes=5001 feasibleNodes=1
So between 18:05:03 and 18:06:04 kube-scheduler was scheduling only daemonsets.
So it looks like due to fact that replicaset-controller and daemonset-controller have separate rate limiters for api calls, they can generate more pods (200 qps/s) than kube-scheduler is able to schedule (100 qps/s) and managed to generate O(minute) backlog of work that slowed down "normal" pods binding.
This is great finding. Given we already have P&F enabled, I'm wondering if we can consider bumping scheduler QPS limits in our 5k-node job to 200 to accommodate for those scenarios (without changing anything on controller-manager side). This would also allow us to validate that further.
@mborsz - once https://github.com/kubernetes/kubernetes/issues/97798 is debugged and fixed, WDYT about this?
Another important bit of information here is that the deamonset in the load test has its own priorityClass (https://github.com/kubernetes/perf-tests/blob/6aa08f8817fd347b3ccf4d18d29260ce2f57a0a1/clusterloader2/testing/load/daemonset-priorityclass.yaml.). This is why the daemonset pods are starving other pods during scheduling phase.
I believe the main issue here is that one of the load test assumptions is that it should create pods with the given throughput (set via LOAD_TEST_THROUGHPUT
). This assumption stopped being true when we introduced daemonsets. What's more, given that deamonsets have higher priority than any other pods in the tests, it's really hard to actually create/update them in parallel with other pods without running into this kind of issues.
I discussed this with Wojtek, and I believe with both agree that it might make sense to move the daemonset operations to a separate CL2 Step. Because steps are executed in serial, it'll stop creating/updating deamonsets in parallel with other pods. This might make the load test minimally slower, but will definitely help with the issue that Maciek found.
[ ] Move the daemonset creation phase to a separate step. We'll also need to wait for the daemonset pods to be created before we start creating other objects. To summarize the current flow is:
It should be changed to:
[ ] Do the same for the daemonset rolling upgrade phase - https://github.com/kubernetes/perf-tests/blob/ece6960810e96260f6b99443c0280de4d384b13b/clusterloader2/testing/load/config.yaml#L598-L608
/good-first-issue /help-wanted
@mm4tt: This request has been marked as suitable for new contributors.
Please ensure the request meets the requirements listed here.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-good-first-issue
command.
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale
.
Stale issues rot after an additional 30d of inactivity and eventually close.
If this issue is safe to close now please do so with /close
.
Send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
I was wondering about why the actual graphs in perf-dash are counter-intuitive and I filed #2006 for that. This explains why the phases that involve schedule time seem much better than they are in reality.
The finding from the experiment of increased scheduling throughput:
With https://github.com/kubernetes/kubernetes/pull/108648 merged, we have mitigated the problem described above.
The outcome is that our pod-startup latency drastically dropped as visible on the graph:
Now we're at the point, that (for stateless pods, stateful ones are expected to take longer time to start in many cases), our 99th percentile is generally around 6.5s.
I took a brief look into the last run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1505590326146895872
And I skimmed through 20 slow-to-start pods. All of them seem to be exhibiting exactly the same pattern, e.g. one group of slow pods:
I0320 19:05:34.406141 11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-28/medium-deployment-15-778d5d76f6-r9t6s" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=501 feasibleNodes=500
I0320 19:05:34.950690 11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-24/small-deployment-48-5d8c899cfb-tvjjd" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=500 feasibleNodes=500
I0320 19:05:35.190715 11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-32/medium-deployment-8-6787c8f556-cbfjg" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=501 feasibleNodes=500
I0320 19:05:35.224297 11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-46/big-deployment-1-54559d54b4-5pzjp" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=501 feasibleNodes=500
I0320 19:05:35.338402 11 scheduler.go:692] "Successfully bound pod to node" pod="test-htamfw-1/daemonset-0-24ntx" node="gce-scale-cluster-minion-group-2-hrmt" evaluatedNodes=1 feasibleNodes=1
So we're effectively scheduling 5 pods on the same node within <1s. This in turn results in slow operations at the containerd level, e.g.:
time="2022-03-20T19:05:35.533055508Z" level=info msg="RunPodsandbox for &PodSandboxMetadata{Name:medium-deployment-15-778d5d76f6-r9t6s,Uid:63ab1c9f-0cd8-4073-9676-26c32ac6a57c,Namespace:test-htamfw-28,Attempt:0,}"
time="2022-03-20T19:05:44.879770896Z" level=info msg="RunPodSandbox for &PodSandboxMetadata{Name:medium-deployment-15-778d5d76f6-r9t6s,Uid:63ab1c9f-0cd8-4073-9676-26c32ac6a57c,Namespace:test-htamfw-28,Attempt:0,} returns sandbox id \"ad6a4e66b7a7f4f203531a162b56153eb4fdadd7756b47cc1faaf2fb3c8495ef\""
[so we waited almost 10s for creation of pod-sandbox]
OTOH - we know that containerd is able to keep up with 5pods/s throughput from the containerd node-throughput test: https://perf-dash.k8s.io/#/?jobname=containerd-node-throughput&metriccategoryname=E2E&metricname=PodStartup&Metric=pod_startup
In our case, the problem is purely resource-starvation at the node level (with containerd not having enough resources to perform its job fast enough, especially given that kube-proxy can consume a lot of resources too).
If we want to bring pod-startup in our tests below 5s, we more-or-less have two options: 1) increase nodes [which would be really expensive, especially given that we would have to bump all tests, including presubmits, etc.] 2) somehow ensure that the churn on a given node will not be that high...
There is still a question what caused scheduler to put 5 pods on the same node so quickly - we know that one other pod was deleted right before that, but that should only result in 1 more pod jumping into that place. It's unclear to me why we got 5 pods schedule to the same node within that short period (there were ~150 pods scheduled during that time).
The potentially useful piece of information is that it was the time when daemonset was actually being updated too.
I took a brief look into the last run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1505590326146895872
where can I find at the scheduler logs?
And I skimmed through 20 slow-to-start pods. All of them seem to be exhibiting exactly the same pattern, e.g. one group of slow pods:
Perhaps they where placed on a newly scaled up node? by default the preference is to select nodes with lower usage.
where can I find at the scheduler logs?
@ahg-g - you click at "Master and node logs" towards bottom of the page and then on the gcs link and then on the master and they are here: https://gcsweb.k8s.io/gcs/k8s-infra-scalability-tests-logs/ci-kubernetes-e2e-gce-scale-performance/1505590326146895872/gce-scale-cluster-master/
Perhaps they where placed on a newly scaled up node? by default the preference is to select nodes with lower usage.
Nope - this cluster has static size.
I looked a bit more into and all long-starting pods I've seen were due to the fact there were a bunch of pods starting on a node at the same time. In other words - improving spreading should visibly help the situation here.
So the question is why we're spreading the pods so unequally here (e.g. in the last run 5k-test, in scheduling-throughput phase the 5k pods were scheduled on ~2850 nodes). And technically ~all nodes should be feasible.
Assuming I didn't miss anything above, it has to be related to the scoring functions - we have the following:
The above doesn't hold for the "create/update" phase, where not all pods are from the same ownerRef, In this situation, the first two would be generally the dominating ones. While we won't get perfect here, consuming equal percentage of node's RAM and CPU by all pods could help a lot here.
We may claim that it's playing with a test to make that pass, but in reality we would just put bigger nodes to avoid resource starvation and we don't want that here to reduce costs.
I played a bit more with this in https://github.com/kubernetes/kubernetes/pull/109067
In particular this run of 5k presubmit: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109067/pull-kubernetes-e2e-gce-scale-performance-manual/1517436913894559744
Let's focus purely on the e2e-a6cf091dfb-bf35d-minion-group-1-75b5
node from that run.
For completeness here are the pods that were scheduled through this run on that node:
wojtekt@wojtekt-l:~/Downloads$ cat out.txt | grep -n "75b5"
1736:I0422 10:03:19.785536 11 schedule_one.go:264] "Successfully bound pod to node" pod="kube-system/metadata-proxy-v0.1-zvzhq" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
2062:I0422 10:03:22.952746 11 schedule_one.go:264] "Successfully bound pod to node" pod="kube-system/konnectivity-agent-jwhlp" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
3888:I0422 10:04:11.467082 11 schedule_one.go:264] "Successfully bound pod to node" pod="kube-system/coredns-6456b64994-l6mdc" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=674 feasibleNodes=674
14455:I0422 10:35:10.779761 11 schedule_one.go:264] "Successfully bound pod to node" pod="gce-pd-csi-driver/csi-gce-pd-node-8jtzn" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
77278:I0422 11:08:22.205770 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-45/medium-deployment-16-76bb9857d7-4ll2r" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
96733:I0422 11:15:47.509886 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-50/small-deployment-14-7d85df95c5-rtb46" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
102709:I0422 11:17:20.890787 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-40/small-deployment-95-6457576f46-c9djr" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
110619:I0422 11:19:46.709317 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-29/big-deployment-0-5789f8949c-v8t4r" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
114933:I0422 11:21:10.480875 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/big-deployment-0-5789f8949c-d56l4" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
117109:I0422 11:21:58.200584 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-16/small-deployment-122-86cb4bb847-w6m7j" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
127985:I0422 11:25:18.870483 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-24/medium-deployment-7-7d48959b7-bsqf6" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
129546:I0422 11:25:29.590314 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-1/daemonset-0-9zp4t" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
132280:I0422 11:25:50.848616 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-42/medium-deployment-17-6bf8597b66-shmvd" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
134716:I0422 11:26:45.950966 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-25/big-deployment-1-7c65455464-n9sdg" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
136920:I0422 11:27:33.499735 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-19/small-deployment-277-86667b5978-8c2rs" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
154684:I0422 11:32:55.710970 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-2/small-deployment-136-555b89c9dd-n4x5j" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
154775:I0422 11:32:58.614263 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-27/medium-deployment-18-797999c589-hw92h" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
157747:I0422 11:34:02.462108 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/big-deployment-1-7c65455464-xt4sx" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
157983:I0422 11:34:05.108176 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-42/small-deployment-270-579bbd889-b46fs" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158046:I0422 11:34:05.735784 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/small-deployment-212-677f957b89-zzrbz" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158073:I0422 11:34:06.011768 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-43/medium-deployment-17-6bf8597b66-b5gjn" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158164:I0422 11:34:07.269574 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-23/big-deployment-0-5789f8949c-77nlk" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158892:I0422 11:34:17.518477 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-6/medium-deployment-10-8d689bcd9-gkcn9" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
172077:I0422 11:52:31.584073 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-51/scheduler-throughput-deployment-2-54794b6b6-kwlkx" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
172104:I0422 11:52:31.816736 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-51/scheduler-throughput-deployment-0-7dc4b8f7cd-4rxvm" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
174159:I0422 11:59:48.878411 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-5/latency-deployment-63-655d645594-2c7p7" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
187413:I0422 12:36:12.706379 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-1/daemonset-0-8zvc9" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=1 feasibleNodes=1
In particular, what was causing the problem there were this part from the above:
157983:I0422 11:34:05.108176 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-42/small-deployment-270-579bbd889-b46fs" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158046:I0422 11:34:05.735784 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-46/small-deployment-212-677f957b89-zzrbz" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158073:I0422 11:34:06.011768 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-43/medium-deployment-17-6bf8597b66-b5gjn" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
158164:I0422 11:34:07.269574 11 schedule_one.go:264] "Successfully bound pod to node" pod="test-92du0z-23/big-deployment-0-5789f8949c-77nlk" node="e2e-a6cf091dfb-bf35d-minion-group-1-75b5" evaluatedNodes=549 feasibleNodes=549
where we scheduled 4 pods within 2s on the same node (in that period of time we scheduled ~180 pods in total, so having 4 landing on the same node is a bit unexpected).
So in that PR I added an extended logging (which is killing the test, but that was supposed only to help with debugging). In particular what brought my attention was this:
LLL: pod=test-92du0z-42/small-deployment-270-579bbd889-b46fs node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=72 TS=200 BA=96
LLL: pod=test-92du0z-46/small-deployment-212-677f957b89-zzrbz node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=71 TS=200 BA=97
LLL: pod=test-92du0z-43/medium-deployment-17-6bf8597b66-b5gjn node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=71 TS=200 BA=97
LLL: pod=test-92du0z-23/big-deployment-0-5789f8949c-77nlk node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=70 TS=200 BA=97
The NR and TS scores look reasonable to me, but BA (BalancedAllocation) is not what I expect.
In fact all pods from all deployments we have are setting cpu=5m mem=20MB: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/deployment.yaml#L4-L5 https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/daemonset.yaml#L34-L35
So assuming that the initial pods (system addons and daemons, e.g. coreDNS, csi, ...) occupy A mcpu and B MB of RAM, then basically with n additional test pods we get:
cpu_requested = A + n*5
ram_requested = B + n*20
According to the code of computing balanced allocation: https://github.com/kubernetes/kubernetes/blob/master/pkg/scheduler/framework/plugins/noderesources/balanced_allocation.go#L115-L117
with N testing pods we should get
std = | ((A+n*5)/node_cpu - (B+n*20)/node_ram) / 2 |
For e2-medium machines I thought we have 1000 mcpu and 4000 MB RAM, which would give us:
std = | ((4*A+4*n*5)/4*node_cpu - (B+n*20)/node_ram) / 2 | = | ((4*A+20*n)/4000 - (B+n*20)/4000) / 2| =
| ((4*A+20*n - (B+20*n))/8000 | = | (4*A-B) / 8000 |
so constant independently from the additional testing pods.
Whereas when grepping through logs for that, I'm getting very different values of BA, e.g.:
wojtekt@wojtekt-l:~/Downloads$ cat kube-scheduler.log | grep LLL | grep "75b5" | grep BA
...
LLL: pod=kube-system/coredns-6456b64994-gbqm9 node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=87 TS=200 BA=94
...
LLL: pod=probes/ping-server-7f5864b8c6-ppvn9 node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=73 TS=200 BA=93
...
LLL: pod=test-92du0z-25/small-deployment-209-678f5b4577-5fql8 node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=77 TS=200 BA=95
...
LLL: pod=test-92du0z-23/big-deployment-0-5789f8949c-dkb7z node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=70 TS=180 BA=97
...
LLL: pod=test-92du0z-3/latency-deployment-53-77574d7dfb-h6k9m node=e2e-a6cf091dfb-bf35d-minion-group-1-75b5 - NR=66 TS=200 BA=98
...
so very different scores.
So the next step is to validate if the assumption about node allocatable is actually correct. If so, there seem to be some bug in scheduler...
OK - so the assumption above is incorrect - allocatable from the e2-medium node:
I0425 13:29:01.356114 14825 cluster.go:86] Name: e2e-2053-62db2-minion-group-5rzq, clusterIP: 10.40.0.24, externalIP: , isSchedulable: true, allocatable: v1.ResourceList{"attachable-volumes-gce-pd":resource.Quantity{i:resource.int64Amount{value:15, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"15", Format:"DecimalSI"}, "cpu":resource.Quantity{i:resource.int64Amount{value:2, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"2", Format:"DecimalSI"}, "ephemeral-storage":resource.Quantity{i:resource.int64Amount{value:43554137631, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"43554137631", Format:"DecimalSI"}, "hugepages-2Mi":resource.Quantity{i:resource.int64Amount{value:0, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"0", Format:"DecimalSI"}, "memory":resource.Quantity{i:resource.int64Amount{value:3866206208, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"", Format:"BinarySI"}, "pods":resource.Quantity{i:resource.int64Amount{value:110, scale:0}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"110", Format:"DecimalSI"}}
So basically:
So continuing the above:
HardEvictionThresholds:[{Signal:memory.available Operator:LessThan Value:{Quantity:250Mi Percentage:0}
which effectively reduces the machine size by 250MB
So what is happening here is that currently "BalancedAllocation" score goes the other direction than "NodeFitsResources" on many nodes (leading to potentially scheduling multiple pods on the same node within the short period of time).
The simplest way to mitigate it is to adjust the allocatable cpu by setting kubeReserved/systemReserved (the 2cpu isn't true anyway).
OK - so the above moved us a long way (in fact got down to 5s for 100th percentile based on last two runs of 5k-node test).
That said, we're still high on the main "pod startup latency".
I took a quick look at the last run: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/1519723567455932416
I took at O(10) pods with largest startup_time and almost all of them didn't have any other pods starting on the same node at the same time. So the problem doesn't seem to be scheduling-later anymore.
Picking one of the pods, I'm seeing the following in kubelet logs:
I0428 19:10:25.026495 1680 kubelet.go:2060] "SyncLoop ADD" source="api" pods=[test-4i97xb-45/medium-deployment-20-7857898896-bddht]
...
E0428 19:10:26.534520 1680 projected.go:192] Error preparing data for projected volume kube-api-access-b4wph for pod test-4i97xb-45/medium-deployment-20-7857898896-bddht: [failed to fetch token: serviceaccounts "default" is forbidden: User "system:node:gce-scale-cluster-minion-group-ph7w" cannot create resource "serviceaccounts/token" in API group "" in the namespace "test-4i97xb-45": no relationship found between node 'gce-scale-cluster-minion-group-ph7w' and this object, failed to sync configmap cache: timed out waiting for the condition]
...
E0428 19:10:28.025197 1680 nestedpendingoperations.go:335] Operation for "{volumeName:kubernetes.io/configmap/87f89c3c-3837-4bca-a2d6-c5430cfc411a-configmap podName:87f89c3c-3837-4bca-a2d6-c5430cfc411a nodeName:}" failed. No retries permitted until 2022-04-28 19:10:29.02516469 +0000 UTC m=+7114.679135177 (durationBeforeRetry 1s). Error: MountVolume.SetUp failed for volume "configmap" (UniqueName: "kubernetes.io/configmap/87f89c3c-3837-4bca-a2d6-c5430cfc411a-configmap") pod "medium-deployment-20-7857898896-bddht" (UID: "87f89c3c-3837-4bca-a2d6-c5430cfc411a") : failed to sync configmap cache: timed out waiting for the condition
...
I0428 19:10:29.026588 1680 operation_generator.go:703] "MountVolume.SetUp succeeded for volume \"configmap\" (UniqueName: \"kubernetes.io/configmap/87f89c3c-3837-4bca-a2d6-c5430cfc411a-configmap\") pod \"medium-deployment-20-7857898896-bddht\" (UID: \"87f89c3c-3837-4bca-a2d6-c5430cfc411a\") " pod="test-4i97xb-45/medium-deployment-20-7857898896-bddht"
It took longer then expected later too, but that partially may be a consequence of some backoffs or sth.
Looking into other pods, e.g. medium-deployment-2-c99c86f6b-ch699 gives pretty much the same situation.
I initially was finding pods scheduling around the same time. But it's not limited to a single time. As an example, for the pod we're also getting the same:
I0428 18:20:22.921344 1686 kubelet.go:2060] "SyncLoop ADD" source="api" pods=[test-4i97xb-2/small-deployment-62-6db75944b4-jz2gh]
...
E0428 18:20:24.194978 1686 projected.go:192] Error preparing data for projected volume kube-api-access-gbjrx for pod test-4i97xb-2/small-deployment-62-6db75944b4-jz2gh: [failed to fetch token: serviceaccounts "default" is forbidden: User "system:node:gce-scale-cluster-minion-group-ds35" cannot create resource "serviceaccounts/token" in API group "" in the namespace "test-4i97xb-2": no relationship found between node 'gce-scale-cluster-minion-group-ds35' and this object, failed to sync configmap cache: timed out waiting for the condition]
...
I0428 18:20:27.421608 1686 operation_generator.go:703] "MountVolume.SetUp succeeded for volume \"secret\" (UniqueName: \"kubernetes.io/secret/7739fba6-34f5-46be-9f6c-79c9f8477a5d-secret\") pod \"small-deployment-62-6db75944b4-jz2gh\" (UID: \"7739fba6-34f5-46be-9f6c-79c9f8477a5d\") " pod="test-4i97xb-2/small-deployment-62-6db75944b4-jz2gh"
So it seems that latency in the node-authorizer is the biggest bottleneck for pod startup. I will try to look a bit into it.
With PR decreasing the indexing threshold in node-authorizer: https://prow.k8s.io/view/gs/kubernetes-jenkins/pr-logs/pull/109067/pull-kubernetes-e2e-gce-scale-performance-manual/1521023813385457664
seems to no longer have those, but it didn't really move the needle... It seems to now mostly boil-down to slow nodes on the container runtime side...
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues and PRs according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle rotten
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.
This bot triages issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/reopen
/remove-lifecycle rotten
Please send feedback to sig-contributor-experience at kubernetes/community.
/close not-planned
@k8s-triage-robot: Closing this issue, marking it as "Not Planned".
/remove-lifecycle rotten /reopen
@wojtek-t: Reopened this issue.
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle rotten
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale
The Kubernetes project currently lacks enough contributors to adequately respond to all issues.
This bot triages un-triaged issues according to the following rules:
lifecycle/stale
is appliedlifecycle/stale
was applied, lifecycle/rotten
is appliedlifecycle/rotten
was applied, the issue is closedYou can:
/remove-lifecycle stale
/close
Please send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale
/remove-lifecycle stale /lifecycle frozen
PodStartup from SaturationPhase signiifcantly dropped recently:
This seem to be the diff between those two runs: https://github.com/kubernetes/kubernetes/compare/bded41a81...1700acb03
We should understand why that happened, as if this is expected (not a bug somewhere), this would potentially allow us to achieve our long-standing goal to get rid of latency phase from density test completely.
@mm4tt @mborsz - FYI