Closed mm4tt closed 5 years ago
/priority critical-urgent
@fejta, could you take a look or reassign? /assign @fejta
/sig testing
@mm4tt FYI the best place to escalate something like this is in #testing-ops on Slack, pinging @test-infra-oncall. I've shot a message over here
/milestone v1.14 /unassign @fejta not available at the moment /assign @amwat as 1.14 test-infra lead, and currently on-call per: go.k8s.io/oncall
There are other jobs behaving similarly to this one, i.e. they are scheduled and run when they shouldn't be run.
could you list them?
Strangely https://prow.k8s.io/?job=ci-kubernetes-e2e-gke-large-performance-regional has one entry at Mar 03 00:01:39
https://prow.k8s.io/rerun?prowjob=931c4290-3d8a-11e9-9c9a-0a580a6c0e78
https://testgrid.k8s.io/sig-scalability-gke#gke-large-performance-regional
@BenTheElder I recently added more logging to horologium
to discern why a job was triggered -- what are the logs saying there?
All of the entries in testgrid are showing the same pod ID on deck, 931c4290-3d8a-11e9-9c9a-0a580a6c0e78
(click "more info" on the "Test started ..." box)
https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-large-performance-regional/161 https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-large-performance-regional/162 https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-large-performance-regional/163 https://prow.k8s.io/view/gcs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gke-large-performance-regional/164
the logs for horologium don't seem to show much so far (stackdriver export for horologium pod, text:ci-kubernetes-e2e-gke-large-performance-regional
, back to 3/2/19 2:00:00 AM)
plank for text:931c4290-3d8a-11e9-9c9a-0a580a6c0e78
🤔
Wait, so were there actually multiple versions of the test running at once?
Matt Matejczyk FYI the best place to escalate something like this is in #testing-ops on Slack, pinging @test-infra-oncall. I've shot a message over here
Thanks, @stevekuznetsov. Will keep that in mind for the future.
There are other jobs behaving similarly to this one, i.e. they are scheduled and run when they shouldn't be run.
could you list them?
@BenTheElder, other examples
name: ci-kubernetes-e2e-gce-scale-performance config
Job should be run once Mon-Fri, but recently there are days when it's run twice or thrice:
name: ci-kubernetes-e2e-gke-large-performance config
Job is supposed to be run once every Sunday, but last yesterday was launched twice:
There are probably a few more.
Were you able to figure out what is going on?
/cc
component: "plank"
job: "ci-kubernetes-e2e-gce-scale-performance"
level: "info"
msg: "Pod is missing, starting a new pod"
We probably hit the OOMKilled again?
E I0304 18:02:18.055] Call: gsutil -q -h Content-Type:application/json -h x-goog-if-generation-match:1551706405925783 cp /tmp/gsutil_h1mRnW gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/jobResultsCache.json
E I0304 18:02:19.881] process 693067 exited with code 0 after 0.0m
E I0304 18:02:19.884] Call: gsutil -q -h Content-Type:application/json cp /tmp/gsutil_7WyV5x gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/324/finished.json
E I0304 18:02:21.640] process 693245 exited with code 0 after 0.0m
E I0304 18:02:21.641] Call: gsutil -q -h Content-Type:text/plain -h 'Cache-Control:private, max-age=0, no-transform' cp /tmp/gsutil_tpumdl gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/latest-build.txt
E I0304 18:02:23.301] process 693423 exited with code 0 after 0.0m
E I0304 18:02:23.302] Call: gsutil -q cp -Z /workspace/build-log.txt gs://kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-scale-performance/324/build-log.txt
E I0304 18:02:25.317] process 693601 exited with code 0 after 0.0m
E + EXIT_VALUE=1
E + set +o xtrace
E Cloning into 'test-infra'...
E Activated service account credentials for: [pr-kubekins@kubernetes-jenkins-pull.iam.gserviceaccount.com]
E fatal: Not a git repository (or any of the parent directories): .git
@stevekuznetsov seems the pod finished and exited properly? Seems a bug in plank?
Horologium triggered the job properly afaik
also - no associated logs in sinker (so - who deleted the pod?)
@krzyzacy what was the behavior? Plank will create a Pod if one does not exist and the ProwJob is not marked in some completed state, can you try to determine via logs how the pod exited and what the state of the prowjob was at the time?
The pod exited with 1 (with E + EXIT_VALUE=1
) I believe...
I think the problem occurs after https://github.com/kubernetes/test-infra/pull/11477? (Feb.26 according to @mm4tt 's screenshot)
And I think the prowjob was still in pending state as I don't see any other state transition log:
2019-03-04 00:01:53.000 PST
{"msg":"CreatePod({{ } {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[prow.k8s.io/id:bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 created-by-prow:true prow.k8s.io/type:periodic prow.k8s.io/job:ci-kubernetes-e2e-gce-scale-performance preset-e2e-scalability-common:t…
2019-03-04 00:01:53.000 PST
{"component":"plank","type":"periodic","from":"triggered","msg":"Transitioning states.","to":"pending","job":"ci-kubernetes-e2e-gce-scale-performance","level":"info","name":"bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78"}
2019-03-04 00:01:53.000 PST
{"msg":"ReplaceProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78, {{ProwJob prow.k8s.io/v1} {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 default /apis/prow.k8s.io/v1/namespaces/default/prowjobs/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 bc8b31ae-3e53-11e9-898b-42010a80003a 189526855 1 2019-03-04 08:01:37 +0000 UTC <…
2019-03-04 00:01:54.000 PST
{"component":"plank","msg":"GetProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78)","client":"kube","level":"debug"}
2019-03-04 00:01:54.000 PST
{"component":"plank","msg":"ReplaceProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78, {{ProwJob prow.k8s.io/v1} {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 default /apis/prow.k8s.io/v1/namespaces/default/prowjobs/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 bc8b31ae-3e53-11e9-898b-42010a80003a 189526926 1 2019-03-04 …
2019-03-04 05:33:53.000 PST
{"msg":"CreatePod({{ } {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[prow.k8s.io/type:periodic prow.k8s.io/job:ci-kubernetes-e2e-gce-scale-performance preset-e2e-scalability-common:true preset-k8s-ssh:true preset-service-account:true prow.k8s.io/id:bc8b0b…
2019-03-04 05:33:53.000 PST
{"msg":"Pod is missing, starting a new pod","job":"ci-kubernetes-e2e-gce-scale-performance","level":"info","name":"bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78","type":"periodic","component":"plank"}
2019-03-04 05:33:53.000 PST
{"component":"plank","msg":"ReplaceProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78, {{ProwJob prow.k8s.io/v1} {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 default /apis/prow.k8s.io/v1/namespaces/default/prowjobs/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 bc8b31ae-3e53-11e9-898b-42010a80003a 189526934 1 2019-03-04 …
2019-03-04 05:33:53.000 PST
{"component":"plank","msg":"GetProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78)","client":"kube","level":"debug"}
2019-03-04 05:33:53.000 PST
{"component":"plank","msg":"ReplaceProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78, {{ProwJob prow.k8s.io/v1} {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 default /apis/prow.k8s.io/v1/namespaces/default/prowjobs/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 bc8b31ae-3e53-11e9-898b-42010a80003a 189621384 1 2019-03-04 …
2019-03-04 10:02:54.000 PST
{"client":"kube","level":"debug","component":"plank","msg":"CreatePod({{ } {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 0 0001-01-01 00:00:00 +0000 UTC <nil> <nil> map[prow.k8s.io/id:bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 created-by-prow:true prow.k8s.io/type:periodic prow.k8s.io/job:ci-kubernetes-e2e-g…
2019-03-04 10:02:54.000 PST
{"msg":"Pod is missing, starting a new pod","job":"ci-kubernetes-e2e-gce-scale-performance","level":"info","name":"bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78","component":"plank","type":"periodic"}
2019-03-04 10:02:54.000 PST
{"component":"plank","msg":"ReplaceProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78, {{ProwJob prow.k8s.io/v1} {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 default /apis/prow.k8s.io/v1/namespaces/default/prowjobs/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 bc8b31ae-3e53-11e9-898b-42010a80003a 189621384 1 2019-03-04 …
2019-03-04 10:02:55.000 PST
{"msg":"GetProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78)","client":"kube","level":"debug","component":"plank"}
2019-03-04 10:02:55.000 PST
{"component":"plank","msg":"ReplaceProwJob(bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78, {{ProwJob prow.k8s.io/v1} {bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 default /apis/prow.k8s.io/v1/namespaces/default/prowjobs/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78 bc8b31ae-3e53-11e9-898b-42010a80003a 189698498 1 2019-03-04 …
Edit: Just paste the full node log here:
Mar 04 18:02:25 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:25Z" level=info msg="Finish piping stderr of container "edeb1523a687b7e5a80ca831b9760a1a6328be767b35a3197d6919752681fc2b""
Mar 04 18:02:25 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:25Z" level=info msg="Finish piping stdout of container "edeb1523a687b7e5a80ca831b9760a1a6328be767b35a3197d6919752681fc2b""
Mar 04 18:02:25 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:25Z" level=error msg="collecting metrics for edeb1523a687b7e5a80ca831b9760a1a6328be767b35a3197d6919752681fc2b" error="cgroups: cgroup deleted"
Mar 04 18:02:25 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:25Z" level=info msg="shim reaped" id=edeb1523a687b7e5a80ca831b9760a1a6328be767b35a3197d6919752681fc2b
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.074868 1260 kubelet.go:1883] SyncLoop (PLEG): "bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78_test-pods(2763ab6d-3e82-11e9-989e-42010a800133)", event: &pleg.PodLifecycleEvent{ID:"2763ab6d-3e82-11e9-989e-42010a800133", Type:"ContainerDied", Data:"edeb1523a687b7e5a80ca831b9760a1a6328be767b35a3197d6919752681fc2b"}
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:26Z" level=info msg="StopPodSandbox for "b1fee6999e358a3e062e04f40ff5be40ac9d89b96146ece5e0ff8c541b485a4e""
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:26Z" level=info msg="Container to stop "edeb1523a687b7e5a80ca831b9760a1a6328be767b35a3197d6919752681fc2b" is not running, current state "CONTAINER_EXITED""
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 systemd-networkd[350]: veth57837a20: Lost carrier
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 systemd-timesyncd[314]: Network configuration changed, trying to establish connection.
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 systemd-timesyncd[314]: Synchronized to time server 169.254.169.254:123 (169.254.169.254).
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 systemd-timesyncd[314]: Network configuration changed, trying to establish connection.
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 systemd-timesyncd[314]: Synchronized to time server 169.254.169.254:123 (169.254.169.254).
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:26Z" level=info msg="TearDown network for sandbox "b1fee6999e358a3e062e04f40ff5be40ac9d89b96146ece5e0ff8c541b485a4e" successfully"
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.228781 1260 reconciler.go:181] operationExecutor.UnmountVolume started for volume "ssh" (UniqueName: "kubernetes.io/secret/2763ab6d-3e82-11e9-989e-42010a800133-ssh") pod "2763ab6d-3e82-11e9-989e-42010a800133" (UID: "2763ab6d-3e82-11e9-989e-42010a800133")
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.228850 1260 reconciler.go:181] operationExecutor.UnmountVolume started for volume "service" (UniqueName: "kubernetes.io/secret/2763ab6d-3e82-11e9-989e-42010a800133-service") pod "2763ab6d-3e82-11e9-989e-42010a800133" (UID: "2763ab6d-3e82-11e9-989e-42010a800133")
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.246068 1260 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/2763ab6d-3e82-11e9-989e-42010a800133-ssh" (OuterVolumeSpecName: "ssh") pod "2763ab6d-3e82-11e9-989e-42010a800133" (UID: "2763ab6d-3e82-11e9-989e-42010a800133"). InnerVolumeSpecName "ssh". PluginName "kubernetes.io/secret", VolumeGidValue ""
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.247230 1260 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/2763ab6d-3e82-11e9-989e-42010a800133-service" (OuterVolumeSpecName: "service") pod "2763ab6d-3e82-11e9-989e-42010a800133" (UID: "2763ab6d-3e82-11e9-989e-42010a800133"). InnerVolumeSpecName "service". PluginName "kubernetes.io/secret", VolumeGidValue ""
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.329170 1260 reconciler.go:301] Volume detached for volume "service" (UniqueName: "kubernetes.io/secret/2763ab6d-3e82-11e9-989e-42010a800133-service") on node "gke-prow-containerd-pool-99179761-9sg5" DevicePath ""
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:26.329219 1260 reconciler.go:301] Volume detached for volume "ssh" (UniqueName: "kubernetes.io/secret/2763ab6d-3e82-11e9-989e-42010a800133-ssh") on node "gke-prow-containerd-pool-99179761-9sg5" DevicePath ""
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:26Z" level=info msg="shim reaped" id=b1fee6999e358a3e062e04f40ff5be40ac9d89b96146ece5e0ff8c541b485a4e
Mar 04 18:02:26 gke-prow-containerd-pool-99179761-9sg5 containerd[1141]: time="2019-03-04T18:02:26Z" level=info msg="StopPodSandbox for "b1fee6999e358a3e062e04f40ff5be40ac9d89b96146ece5e0ff8c541b485a4e" returns successfully"
Mar 04 18:02:27 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:27.075432 1260 kubelet.go:1883] SyncLoop (PLEG): "bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78_test-pods(2763ab6d-3e82-11e9-989e-42010a800133)", event: &pleg.PodLifecycleEvent{ID:"2763ab6d-3e82-11e9-989e-42010a800133", Type:"ContainerDied", Data:"b1fee6999e358a3e062e04f40ff5be40ac9d89b96146ece5e0ff8c541b485a4e"}
Mar 04 18:02:27 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: W0304 18:02:27.075574 1260 pod_container_deletor.go:75] Container "b1fee6999e358a3e062e04f40ff5be40ac9d89b96146ece5e0ff8c541b485a4e" not found in pod's containers
Mar 04 18:02:33 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:33.622347 1260 kubelet.go:1854] SyncLoop (DELETE, "api"): "bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78_test-pods(2763ab6d-3e82-11e9-989e-42010a800133)"
Mar 04 18:02:33 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:33.623922 1260 kubelet.go:1848] SyncLoop (REMOVE, "api"): "bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78_test-pods(2763ab6d-3e82-11e9-989e-42010a800133)"
Mar 04 18:02:33 gke-prow-containerd-pool-99179761-9sg5 kubelet[1260]: I0304 18:02:33.624053 1260 kubelet.go:2042] Failed to delete pod "bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78_test-pods(2763ab6d-3e82-11e9-989e-42010a800133)", err: pod not found
If the pod exited, are you seeing sinker
clean it up? If not, do you have audit logging on? Do you know what deleted the Pod?
I don't see the clean up from sinker. How do I check the audit log?
(we don't have access to the master apiserver though...)
Then that's a no-go. Interesting that sinker
did not delete the Pod
The prow bump job (https://testgrid.k8s.io/sig-testing-prow#autobump-prow) also uses cron but didn't have this issue - so I'd narrow it down to these large scalability jobs... :thinking: really confused here...
@stevekuznetsov
2019-03-04 10:02:33.623 PST
k8s.io
delete
test-pods:bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78
system:serviceaccount:kube-system:pod-garbage-collector
{"@type":"type.googleapis.com/google.cloud.audit.AuditLog","authenticationInfo":{"principalEmail":"system:serviceaccount:kube-system:pod-garbage-collector"},"authorizationInfo":[{"granted":true,"permission":"io.k8s.core.v1.pods.delete","resource":"core/v1/namespaces/test-pods/pods/bc8b0b06-3e53-11e9…
2019-03-04 10:02:54.937 PST
k8s.io
create
test-pods:bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78
client
{"@type":"type.googleapis.com/google.cloud.audit.AuditLog","authenticationInfo":{"principalEmail":"client"},"authorizationInfo":[{"granted":true,"permission":"io.k8s.core.v1.pods.create","resource":"core/v1/namespaces/test-pods/pods/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78"}],"methodName":"io.k8s.core.v…
2019-03-04 10:02:54.942 PST
k8s.io
create
test-pods:bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78:bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78
system:kube-scheduler
{"@type":"type.googleapis.com/google.cloud.audit.AuditLog","authenticationInfo":{"principalEmail":"system:kube-scheduler"},"authorizationInfo":[{"granted":true,"permission":"io.k8s.core.v1.pods.binding.create","resource":"core/v1/namespaces/test-pods/pods/bc8b0b06-3e53-11e9-9c9a-0a580a6c0e78/binding…
hummmmmm...
gke has TERMINATED_POD_GC_THRESHOLD
at 1000 instead of the default 12500 (https://github.com/kubernetes/kubernetes/blob/1b28775db1290a772967d192a19a8ec447053cd5/pkg/controller/apis/config/v1alpha1/defaults.go#L215) (thanks @Random-Liu helped located the issue)
time to have multiple build clusters? :sob: :disappointed: :cry: :joy:
Also we have a ton of imagePullBackOff pods:
senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods | wc -l
1487
senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods | grep ImagePullBackOff | wc -l
232
even better we have things like:
bb6ea232-1adc-11e9-8f0d-0a580a6c02f3 1/1 Running 0 45d
0272c8e4-3d0c-11e9-9c9a-0a580a6c0e78 1/2 Error 0 2d
e861282b-3c65-11e9-9c9a-0a580a6c0e78 0/1 ImagePullBackOff 0 3d
(proposed a discussion topic in tomorrow's sig-testing for this)
/sig scalability
senlu@senlu:~/work/src/k8s.io/test-infra/prow$ kubectl get po -n=test-pods | wc -l
764
with last few fixes it should be fine for now - we still need to figure out how to we bump/workaround that limit since we are (inevitably) going to have more and more jobs.
I think it's happening again
Last time it also started around Friday, is it possible that we run more prow jobs over weekend?
I'm almost 100% sure It's happening again
@krzyzacy, could you check?
/priority critical-urgent
/shrug... plausibly code freeze is coming and we were having heavier testing loads yesterday... let me verify that's still the same issue
@krzyzacy y'all might want a pager to go off when you get close ;)
/assign
/milestone v1.15 Is this still a concern for us?
The root issue is still there, I suspect we'll be hitting this again when testing volume increases..
You can limit the maximum concurrency through plank to 1000, right?
/remove-priority critical-urgent This isn't at drop-everything priority, but we may hit this again this quarter
@cjwagner can we set the plank concurrency and make sure you don't hit this again?
That's global and not per cluster so it's a bit overkill but would at least stop you from getting evicted
@stevekuznetsov what happens when plank hits the pod limit? Stop creating new pods?
Yep, the controller will just not trigger new jobs and they will stay in Pending until there is room.
I can smell snowballing :-p
I'm fine with enabling it if others are. This would prevent us from hitting the concurrency limit, but its not the perfect tool for the job. It could cause snowballing if our actual concurrency level is significantly higher due to executing on multiple build clusters and we'd need to set it to 1000 (not something higher) if we want to guarantee that no individual build cluster exceeds 1000 pods.
We should just implement a per-cluster throttle then instead of a global one. Of course you'd get snowballing but you can't really help that. At least the snowball failure mode is very soft and you'd just run the jobs later
Folks, looks like this is still happening and heavily affecting our scalability CI tests
Currently we're facing a few major regressions (e.g. https://github.com/kubernetes/kubernetes/issues/75833, https://github.com/kubernetes/kubernetes/issues/76579) and not having working CI tests is really slowing us down. There is a big risk that if we don't debug the regressions it will block the 1.15 kubernetes release.
I understand that the issue may be hard to fix properly, but we'd really appreciate if you could come up with some kind of a temporary work-around to unblock us. Is there anything you could do?
We can run all scalability jobs on a separate build cluster as a bandaid for now, WDYT @cjwagner ?
@mm4tt @wojtek-t thoughts?
We can run all scalability jobs on a separate build cluster as a bandaid for now, WDYT @cjwagner ?
I think that's all we can do for now. Switching plank to use the informer framework might let it win the race with the pod GC sometimes, but it may not help at all and even if it did there would still be a race.
Adding another build cluster and migrating the necessary secrets is probably the fastest work around.
Our scalability tests started behaving strangely over the weekend. The prow jobs running tests are scheduled when they shouldn't.
Example: name: ci-kubernetes-e2e-gke-large-performance-regional Config:
Test should be run once a week, but it has been scheduled 4 times over last weekend.
There are other jobs behaving similarly to this one, i.e. they are scheduled and run when they shouldn't be run.
This is wreaking havoc in our scalability tests. Due to quota issues, the tests share the same gcp projects. Now, because they're run when shouldn't, they started interfering with each other causing multiple tests to fail.