Closed spiffxp closed 4 years ago
I think the resource limits we have in place for the presubmit variants of these jobs are what we should be using
CPU limit utilization for pull-kubernetes-e2e-kind
CPU limit utilization for ci-kubernetes-kind-e2e.*
Memory limit utilization for pull-kubernetes-e2e-kind
Memory limit utilization for ci-kubernetes-kind-e2e.*
If we then find we're experiencing a lot of test flakiness (but fewer jobs in error state), we may want to consider whether it's time to look into the extra complexity of scheduling merge-blocking and release-blocking jobs to separate nodepools (or clusters).
Or, we could see if there's some way we can overcome whatever I/O limits we're hitting (ref: https://github.com/kubernetes/k8s.io/issues/1187)
.... ok, https://prow.k8s.io/?job=pull-kubernetes-e2e-kind made me reconsider setting cpu to 4
I match the memory limit but set CPU to 7 instead of 4, based on the fact that integration and verify are scheduling https://github.com/kubernetes/test-infra/pull/19081
As for integration/verify scheduling... they are exhibiting weird duplicate/close testgrid entries, but I think this is actually an unrelated issue, I'm seeing that across a bunch of testgrid jobs, going to open a separate issue (EDIT: opened https://github.com/kubernetes/test-infra/issues/19082 and whatever it is seems to have resolved)
/priority critical-urgent, since these are release-blocking jobs
@spiffxp: The label(s) priority/critical-urgent,
cannot be applied, because the repository doesn't have them
/priority critical-urgent
testgrid alerts have lit up like a christmas tree across all multiple branches 😞
I'm not going to have much time to dig into this today.
After https://github.com/kubernetes/test-infra/pull/19081 merged at 2020-08-31 19:35 PT, CI jobs started scheduling, but consistently failing on one of a few different tests.
If this was purely due to load, I would have expected to see some green from the CI jobs during quieter hours. Based on what load looked like over the past 24h, I would have especially expected it between 10pm-12am PT. From https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&timeDomain=1d:
On top of this, asking for and getting 7 cpu is still basically equivalent to getting an entire n1-highmem-8 node, since there are no jobs scheduled to this cluster that ask for <1 cpu.
So I am suspicious of a regression or two that only kind is catching with any frequency. We may also be I/O bound in some way that is exacerbating this, but I would suggest taking a look at what's merged in kubernetes/kubernetes lately.
Recent merges doesn't explain similar spike in failures on release branches going back to 1.17
Most of these jobs were changed to NOT request that much CPU?
FTR pull-kubernetes-e2e-gce runs on 1x n1 (control plane) and 3x n2 (workers), in addition to the prow pod running e2e for 7+ CPU, including the equivilant postsubmit jobs. It is also relying on dedicated disk and more ram.
So while bin packing these is nice but just local reproducibility, autoscale vs boskos is a win and 7+ CPU request is no more expensive than typical cloudy e2e...
/cc
kubelet event related tests went haywire for all currently tested releases: https://testgrid.k8s.io/sig-release-1.17-blocking#kind-1.17-parallel https://testgrid.k8s.io/sig-release-1.18-blocking#kind-1.18-parallel https://testgrid.k8s.io/sig-release-1.19-blocking#kind-1.19-parallel https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel
seemingly only within the past day or so.
https://github.com/kubernetes-sigs/kind/commits/master not a lot changed here recently, in particular actually nothing changed in the past 3 days. We did start seeing the scheduling issues around the the same time these tests flaked though.
Maybe our CI node fixup daemonsets are not running? (e.g. the one that increases inotify watches?)
is it possible these changes made jobs that previously were serialized run in parallel:
@alvaroaleman ^
@liggitt crier just reports jobs, it doesn't do anything about their execution, that is plank. Do we have any indications that jobs that should be serialized are getting run in parallel?
Do we have any indications that jobs that should be serialized are getting run in parallel?
Not in particular, just looking at test-infra changes that merged around the timeframe things went sideways, and the locking changes stood out. I'm not familiar with how the CI components interact, so just wanted to check what the implications of those changes were, especially this comment:
In 00562873e0c597be9d186ebc3887c3fa0690400c we introduced PR-Level locking for presubmits to the crier github reporter in order to allow running it with multiple workers without having the workers race with each other when creating the GitHub comment.
Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?
Looking at the triage board and testgrids, scheduling failures appear to have started to affect jobs between ~12:25 ET and 1:25 ET, and the "Events should be sent by kubelets and the scheduler" failures ~3:30 ET
Good:
Bad:
Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?
No, its a distinct binary. Furthermore, the locking is only done for presubmit jobs and the jobs here seem to be periodics. Looking at the examples it seems the cluster is out of capacity, all of them are podPendingTimeout (which if too short could be caused by scaleup taking too much time) or fail to schedule altogether.
the locking is only done for presubmit jobs and the jobs here seem to be periodics.
presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?
presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?
Sure, if they run on the same cluster (which I guess they do?). Do you have access to that cluster? Are there any graphs that show cluster load?
@aojea may have identified the culprit, he noticed lots of logs in the kind kubelet logs of the form
Sep 01 19:07:11 kind-worker kubelet[611]: E0901 19:07:11.320994 611 dns.go:125] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: volume-expand-4184-7003.svc.cluster.local svc.cluster.local cluster.local test-pods.svc.cluster.local us-central1-b.c.k8s-infra-prow-build.internal c.k8s-infra-prow-build.internal
cloud logging link shows "us-central1-b.c.k8s-infra-prow-build.internal" didn't show up before 2020-08-31 ~10:50am
Digging through audit logs, turns out I fat fingered setting VmDnsSetting=ZonalPreferred
on the wrong GCP project's common instance metadata (I intended to set it on my personal project). This happened at 2020-08-31 10:37am PT, which lines up with when the jobs started failing.
I deleted it about 15 minutes ago. Checking a random node's /etc/resolv.conf showed the change had taken effect. It may be a bit for jobs to reflect this.
I suspect we're about to go through this
https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind showed an immediate dramatic improvement
I'll hold this open until I see passes on the release-branch CI jobs
I see seas of green in the recent runs
https://github.com/kubernetes-sigs/kind/issues/1860 to track mitigating in the future
What happened:
Seeing a similar pattern for all other kind release-blocking jobs.
Some, but not all of the failures, are of the variety "There are no nodes that your pod can schedule to" according to spyglass.
I suspect what is happening is the following:
What you expected to happen:
I expected cluster autoscaling to spin up new nodes if none were available with sufficient capacity for these jobs.
How to reproduce it (as minimally and precisely as possible):
Please provide links to example occurrences, if any: e.g. for https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel&width=20
Anything else we need to know?:
EDIT: turns out the PR jobs were failing a lot too, though a few specific test cases instead of hitting pod pending (ref: https://github.com/kubernetes/test-infra/issues/19080#issuecomment-684156938)
The increased PR / job traffic caused by v1.20 PR's getting merged is almost certainly exacerbating this