(ci|pull)-kubernetes-kind-e2e.* jobs are failing

spiffxp commented 4 years ago

What happened:

https://testgrid.k8s.io/sig-release-1.19-blocking#kind-1.19-parallel&width=20
- 2020-08-31 11:56 PDT - started failing continuously
- 2020-08-28 13:47 PDT started seeing intermitted incompletes ("still running" according to spyglass)
https://testgrid.k8s.io/sig-release-1.19-blocking#kind-ipv6-1.19-parallel&width=20
- 2020-08-31 11:34 PDT - started failing continuously
- 2020-08-28 15:24 PDT started seeing intermitted incompletes ("still running" according to spyglass)

Seeing a similar pattern for all other kind release-blocking jobs.

Some, but not all of the failures, are of the variety "There are no nodes that your pod can schedule to" according to spyglass.

I suspect what is happening is the following:

these jobs are demanding about an entire node's worth of CPU, so any nodes that have even one job running have insufficient CPU
cluster autoscaling spins up new nodes to meet the demand
new nodes become available, but other jobs with smaller resource requests get scheduled on them, so they have insufficient CPU
pod is eventually terminated by plank due to remaining in Pending for too long

What you expected to happen:

I expected cluster autoscaling to spin up new nodes if none were available with sufficient capacity for these jobs.

How to reproduce it (as minimally and precisely as possible):

Please provide links to example occurrences, if any: e.g. for https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel&width=20

https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300592620015194112 - 3 tests fail
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300577268589924352 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (skip schedule deleting pod: test-pods/332044eb-ebe2-11ea-aff6-5a14b0d7f3dc)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300561918280339458 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/52 nodes are available: 3 node(s) had taints that the pod didn't tolerate, 4 Insufficient memory, 49 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300546567635210241 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/66 nodes are available: 1 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 63 Insufficient cpu.) https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300531215815675904 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/68 nodes are available: 1 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 65 Insufficient cpu.)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300515864868556800 - 2 tests fail
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300500448448352257 - There are no nodes that your pod can schedule to - check your requests, tolerations, and node selectors (0/57 nodes are available: 3 Insufficient memory, 3 node(s) had taints that the pod didn't tolerate, 54 Insufficient cpu.) https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-kind-e2e-parallel/1300485096591069186 - Job execution failed: Pod pending timeout.

Anything else we need to know?:

EDIT: turns out the PR jobs were failing a lot too, though a few specific test cases instead of hitting pod pending (ref: https://github.com/kubernetes/test-infra/issues/19080#issuecomment-684156938)

The increased PR / job traffic caused by v1.20 PR's getting merged is almost certainly exacerbating this

spiffxp commented 4 years ago

I think the resource limits we have in place for the presubmit variants of these jobs are what we should be using

https://prow.k8s.io/?state=error - shows the presubmit variants of these jobs have not been having this problem
Limit utilization graphs show the CI jobs have plenty of headroom compared to the presubmits

CPU limit utilization for pull-kubernetes-e2e-kind Screen Shot 2020-08-31 at 6 26 53 PM

CPU limit utilization for ci-kubernetes-kind-e2e.* Screen Shot 2020-08-31 at 6 28 29 PM

Memory limit utilization for pull-kubernetes-e2e-kind Screen Shot 2020-08-31 at 6 52 47 PM

Memory limit utilization for ci-kubernetes-kind-e2e.* Screen Shot 2020-08-31 at 6 52 00 PM

If we then find we're experiencing a lot of test flakiness (but fewer jobs in error state), we may want to consider whether it's time to look into the extra complexity of scheduling merge-blocking and release-blocking jobs to separate nodepools (or clusters).

Or, we could see if there's some way we can overcome whatever I/O limits we're hitting (ref: https://github.com/kubernetes/k8s.io/issues/1187)

spiffxp commented 4 years ago

.... ok, https://prow.k8s.io/?job=pull-kubernetes-e2e-kind made me reconsider setting cpu to 4

I match the memory limit but set CPU to 7 instead of 4, based on the fact that integration and verify are scheduling https://github.com/kubernetes/test-infra/pull/19081

As for integration/verify scheduling... they are exhibiting weird duplicate/close testgrid entries, but I think this is actually an unrelated issue, I'm seeing that across a bunch of testgrid jobs, going to open a separate issue (EDIT: opened https://github.com/kubernetes/test-infra/issues/19082 and whatever it is seems to have resolved)

spiffxp commented 4 years ago

/priority critical-urgent, since these are release-blocking jobs

k8s-ci-robot commented 4 years ago

@spiffxp: The label(s) priority/critical-urgent, cannot be applied, because the repository doesn't have them

In response to [this](https://github.com/kubernetes/test-infra/issues/19080#issuecomment-684158768): >/priority critical-urgent, >since these are release-blocking jobs Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

spiffxp commented 4 years ago

/priority critical-urgent

BenTheElder commented 4 years ago

testgrid alerts have lit up like a christmas tree across all multiple branches 😞

spiffxp commented 4 years ago

I'm not going to have much time to dig into this today.

After https://github.com/kubernetes/test-infra/pull/19081 merged at 2020-08-31 19:35 PT, CI jobs started scheduling, but consistently failing on one of a few different tests.

If this was purely due to load, I would have expected to see some green from the CI jobs during quieter hours. Based on what load looked like over the past 24h, I would have especially expected it between 10pm-12am PT. From https://console.cloud.google.com/monitoring/dashboards/custom/10925237040785467832?project=k8s-infra-prow-build&timeDomain=1d: Screen Shot 2020-09-01 at 9 29 42 AM

On top of this, asking for and getting 7 cpu is still basically equivalent to getting an entire n1-highmem-8 node, since there are no jobs scheduled to this cluster that ask for <1 cpu.

So I am suspicious of a regression or two that only kind is catching with any frequency. We may also be I/O bound in some way that is exacerbating this, but I would suggest taking a look at what's merged in kubernetes/kubernetes lately.

BenTheElder commented 4 years ago

Recent merges doesn't explain similar spike in failures on release branches going back to 1.17

Most of these jobs were changed to NOT request that much CPU?

FTR pull-kubernetes-e2e-gce runs on 1x n1 (control plane) and 3x n2 (workers), in addition to the prow pod running e2e for 7+ CPU, including the equivilant postsubmit jobs. It is also relying on dedicated disk and more ram.

So while bin packing these is nice but just local reproducibility, autoscale vs boskos is a win and 7+ CPU request is no more expensive than typical cloudy e2e...

aojea commented 4 years ago

/cc

BenTheElder commented 4 years ago

kubelet event related tests went haywire for all currently tested releases: https://testgrid.k8s.io/sig-release-1.17-blocking#kind-1.17-parallel https://testgrid.k8s.io/sig-release-1.18-blocking#kind-1.18-parallel https://testgrid.k8s.io/sig-release-1.19-blocking#kind-1.19-parallel https://testgrid.k8s.io/sig-release-master-blocking#kind-master-parallel

seemingly only within the past day or so.

https://github.com/kubernetes-sigs/kind/commits/master not a lot changed here recently, in particular actually nothing changed in the past 3 days. We did start seeing the scheduling issues around the the same time these tests flaked though.

Maybe our CI node fixup daemonsets are not running? (e.g. the one that increases inotify watches?)

liggitt commented 4 years ago

is it possible these changes made jobs that previously were serialized run in parallel:

https://github.com/kubernetes/test-infra/pull/19053
https://github.com/kubernetes/test-infra/pull/19048
https://github.com/kubernetes/test-infra/pull/19045 (bumped crier to include the above two commits)

liggitt commented 4 years ago

@alvaroaleman ^

alvaroaleman commented 4 years ago

@liggitt crier just reports jobs, it doesn't do anything about their execution, that is plank. Do we have any indications that jobs that should be serialized are getting run in parallel?

liggitt commented 4 years ago

Do we have any indications that jobs that should be serialized are getting run in parallel?

Not in particular, just looking at test-infra changes that merged around the timeframe things went sideways, and the locking changes stood out. I'm not familiar with how the CI components interact, so just wanted to check what the implications of those changes were, especially this comment:

In 00562873e0c597be9d186ebc3887c3fa0690400c we introduced PR-Level locking for presubmits to the crier github reporter in order to allow running it with multiple workers without having the workers race with each other when creating the GitHub comment.

Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?

Looking at the triage board and testgrids, scheduling failures appear to have started to affect jobs between ~12:25 ET and 1:25 ET, and the "Events should be sent by kubelets and the scheduler" failures ~3:30 ET

Good:

Bad:

alvaroaleman commented 4 years ago

Is it possible that lock in crier was unintentionally serializing some aspect of the CI queue?

No, its a distinct binary. Furthermore, the locking is only done for presubmit jobs and the jobs here seem to be periodics. Looking at the examples it seems the cluster is out of capacity, all of them are podPendingTimeout (which if too short could be caused by scaleup taking too much time) or fail to schedule altogether.

liggitt commented 4 years ago

the locking is only done for presubmit jobs and the jobs here seem to be periodics.

presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?

alvaroaleman commented 4 years ago

presubmits are also badly affected by this... if we're experiencing a scheduling issue, could something letting presubmits swamp nodes affect periodics?

Sure, if they run on the same cluster (which I guess they do?). Do you have access to that cluster? Are there any graphs that show cluster load?

spiffxp commented 4 years ago

@aojea may have identified the culprit, he noticed lots of logs in the kind kubelet logs of the form

Sep 01 19:07:11 kind-worker kubelet[611]: E0901 19:07:11.320994     611 dns.go:125] Search Line limits were exceeded, some search paths have been omitted, the applied search line is: volume-expand-4184-7003.svc.cluster.local svc.cluster.local cluster.local test-pods.svc.cluster.local us-central1-b.c.k8s-infra-prow-build.internal c.k8s-infra-prow-build.internal

cloud logging link shows "us-central1-b.c.k8s-infra-prow-build.internal" didn't show up before 2020-08-31 ~10:50am

Digging through audit logs, turns out I fat fingered setting VmDnsSetting=ZonalPreferred on the wrong GCP project's common instance metadata (I intended to set it on my personal project). This happened at 2020-08-31 10:37am PT, which lines up with when the jobs started failing.

I deleted it about 15 minutes ago. Checking a random node's /etc/resolv.conf showed the change had taken effect. It may be a bit for jobs to reflect this.

I suspect we're about to go through this

it was DNS

liggitt commented 4 years ago

https://testgrid.k8s.io/presubmits-kubernetes-blocking#pull-kubernetes-e2e-kind showed an immediate dramatic improvement

BenTheElder commented 4 years ago

https://prow.k8s.io/?job=pull-kubernetes-*-kind*

spiffxp commented 4 years ago

I'll hold this open until I see passes on the release-branch CI jobs

BenTheElder commented 4 years ago

I see seas of green in the recent runs

BenTheElder commented 4 years ago

https://github.com/kubernetes-sigs/kind/issues/1860 to track mitigating in the future

kubernetes / test-infra

(ci|pull)-kubernetes-kind-e2e.* jobs are failing #19080