kubernetes / kubernetes

Production-Grade Container Scheduling and Management
https://kubernetes.io
Apache License 2.0
110.91k stars 39.61k forks source link

[Failing test][sig-scheduling][sig-network] ci-kubernetes-gce-conformance-latest-kubetest2 #105073

Closed leonardpahlke closed 3 years ago

leonardpahlke commented 3 years ago

Which jobs are failing:

Conformance - GCE- master - kubetest2

Which test(s) are failing:

Since when has it been failing:

15.09.2021 04:05 PDT

Testgrid link:

TestGrid link Failed Job link (one of them)

Reason for failure:

Kubernetes e2e suite: BeforeSuite:

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:74
Sep 16 14:48:48.290: Error waiting for all pods to be running and ready: 1 / 31 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
POD                      NODE PHASE   GRACE CONDITIONS
konnectivity-agent-f4ghk      Pending       [{Type:PodScheduled Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-09-16 14:38:48 +0000 UTC Reason:Unschedulable Message:0/4 nodes are available: 1 Insufficient cpu, 3 node(s) didn't match Pod's node affinity/selector.}]

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:77

kubetest2: Test

exit status 255

Build log

load pubkey "/root/.ssh/google_compute_engine": invalid format
scp: /var/log/cluster-autoscaler.log*: No such file or directory
scp: /var/log/fluentd.log*: No such file or directory
scp: /var/log/kubelet.cov*: No such file or directory
scp: /var/log/startupscript.log*: No such file or directory
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
Dumping logs from nodes locally to '/logs/artifacts/a50e119b-16f7-11ec-a0c8-aafcb65c973d/cluster-logs'
Detecting nodes in the cluster
Changing logfiles to be world-readable for download
Changing logfiles to be world-readable for download
Changing logfiles to be world-readable for download
skipped 10 lines unfold_more
load pubkey "/root/.ssh/google_compute_engine": invalid format
scp: /var/log/fluentd.log*: No such file or directory
scp: /var/log/node-problem-detector.log*: No such file or directory
scp: /var/log/kubelet.cov*: No such file or directory
scp: /var/log/startupscript.log*: No such file or directory
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
scp: /var/log/fluentd.log*: No such file or directory
scp: /var/log/node-problem-detector.log*: No such file or directory
scp: /var/log/kubelet.cov*: No such file or directory
scp: /var/log/startupscript.log*: No such file or directory
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
load pubkey "/root/.ssh/google_compute_engine": invalid format
scp: /var/log/fluentd.log*: No such file or directory
scp: /var/log/node-problem-detector.log*: No such file or directory
scp: /var/log/kubelet.cov*: No such file or directory
scp: /var/log/startupscript.log*: No such file or directory
ERROR: (gcloud.compute.scp) [/usr/bin/scp] exited with return code [1].
INSTANCE_GROUPS=kt2-a50e119b-16f7-minion-group
NODE_NAMES=kt2-a50e119b-16f7-minion-group-31c1 kt2-a50e119b-16f7-minion-group-fkgj kt2-a50e119b-16f7-minion-group-mhdq
Failures for kt2-a50e119b-16f7-minion-group (if any):
I0916 14:38:40.298102    2911 dumplogs.go:121] About to run: [/logs/artifacts/a50e119b-16f7-11ec-a0c8-aafcb65c973d/kubectl cluster-info dump]
I0916 14:38:40.298143    2911 local.go:42] ⚙️ /logs/artifacts/a50e119b-16f7-11ec-a0c8-aafcb65c973d/kubectl cluster-info dump
skipped 1966 lines unfold_more
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:13 +0000 UTC - event for kube-dns-autoscaler-6494c4c647-m2cp7: {default-scheduler } Scheduled: Successfully assigned kube-system/kube-dns-autoscaler-6494c4c647-m2cp7 to kt2-a50e119b-16f7-minion-group-fkgj
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:13 +0000 UTC - event for l7-default-backend-79858d8f86-8gwss: {default-scheduler } Scheduled: Successfully assigned kube-system/l7-default-backend-79858d8f86-8gwss to kt2-a50e119b-16f7-minion-group-fkgj
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:13 +0000 UTC - event for metrics-server-v0.5.0-74c9d6fd5b-kv5kl: {default-scheduler } Scheduled: Successfully assigned kube-system/metrics-server-v0.5.0-74c9d6fd5b-kv5kl to kt2-a50e119b-16f7-minion-group-fkgj
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:13 +0000 UTC - event for volume-snapshot-controller-0: {default-scheduler } Scheduled: Successfully assigned kube-system/volume-snapshot-controller-0 to kt2-a50e119b-16f7-minion-group-fkgj
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:14 +0000 UTC - event for kube-dns-autoscaler-6494c4c647-m2cp7: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} Pulling: Pulling image "k8s.gcr.io/cpa/cluster-proportional-autoscaler:1.8.4"
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:14 +0000 UTC - event for metrics-server-v0.5.0-74c9d6fd5b-kv5kl: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} FailedMount: MountVolume.SetUp failed for volume "metrics-server-config-volume" : failed to sync configmap cache: timed out waiting for the condition
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:15 +0000 UTC - event for coredns-755cd654d4-krbc2: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} Pulling: Pulling image "k8s.gcr.io/coredns/coredns:v1.8.0"
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:15 +0000 UTC - event for volume-snapshot-controller-0: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} Pulling: Pulling image "k8s.gcr.io/sig-storage/snapshot-controller:v4.0.0"
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:16 +0000 UTC - event for metrics-server-v0.5.0-74c9d6fd5b-kv5kl: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} Pulling: Pulling image "k8s.gcr.io/metrics-server/metrics-server:v0.5.0"
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:17 +0000 UTC - event for event-exporter-v0.3.4-6c59d5574d-fqvn6: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} Pulling: Pulling image "gke.gcr.io/event-exporter:v0.3.4-gke.0"
Sep 16 14:48:43.754: INFO: At 2021-09-16 14:37:17 +0000 UTC - event for fluentd-gcp-scaler-55f8dfc997-plb5j: {kubelet kt2-a50e119b-16f7-minion-group-fkgj} Pulling: Pulling image "k8s.gcr.io/fluentd-gcp-scaler:0.5.2"
skipped 287 lines unfold_more
Sep 16 14:48:48.106: INFO: Running kubectl logs on non-ready containers in kube-system
Sep 16 14:48:48.290: INFO: Logs of kube-system/konnectivity-agent-f4ghk:konnectivity-agent on node 
Sep 16 14:48:48.290: INFO:  : STARTLOG
ENDLOG for container kube-system:konnectivity-agent-f4ghk:konnectivity-agent
Sep 16 14:48:48.290: FAIL: Error waiting for all pods to be running and ready: 1 / 31 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
POD                      NODE PHASE   GRACE CONDITIONS
konnectivity-agent-f4ghk      Pending       [{Type:PodScheduled Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-09-16 14:38:48 +0000 UTC Reason:Unschedulable Message:0/4 nodes are available: 1 Insufficient cpu, 3 node(s) didn't match Pod's node affinity/selector.}]
Full Stack Trace
skipped 9 lines unfold_more
    _output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e_test.go:136 +0x19
testing.tRunner(0xc000603ba0, 0x6d2d340)
    /usr/local/go/src/testing/testing.go:1259 +0x102
created by testing.(*T).Run
    /usr/local/go/src/testing/testing.go:1306 +0x35a
Failure [604.684 seconds]
[BeforeSuite] BeforeSuite 
_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:74
  Sep 16 14:48:48.290: Error waiting for all pods to be running and ready: 1 / 31 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
  POD                      NODE PHASE   GRACE CONDITIONS
skipped 16 lines unfold_more
JUnit report was created: /logs/artifacts/a50e119b-16f7-11ec-a0c8-aafcb65c973d/junit_01.xml
{"msg":"Test Suite completed","total":346,"completed":0,"skipped":0,"failed":0}
Ran 346 of 0 Specs in 604.686 seconds
FAIL! -- 0 Passed | 346 Failed | 0 Pending | 0 Skipped
--- FAIL: TestE2E (606.72s)
FAIL
Ginkgo ran 1 suite in 10m6.81509859s
Test Suite Failed
F0916 14:48:48.317723   96964 ginkgo.go:205] failed to run ginkgo tester: exit status 1
I0916 14:48:48.320478    2911 down.go:29] GCE deployer starting Down()
skipped 44 lines unfold_more
Property "contexts.k8s-infra-e2e-boskos-054_kt2-a50e119b-16f7" unset.
Cleared config for k8s-infra-e2e-boskos-054_kt2-a50e119b-16f7 from /logs/artifacts/a50e119b-16f7-11ec-a0c8-aafcb65c973d/kubetest2-kubeconfig
Done
I0916 14:54:54.490491    2911 down.go:53] about to delete nodeport firewall rule
I0916 14:54:54.490593    2911 local.go:42] ⚙️ gcloud compute firewall-rules delete --project k8s-infra-e2e-boskos-054 kt2-a50e119b-16f7-minion-nodeports
ERROR: (gcloud.compute.firewall-rules.delete) Could not fetch resource:
 - The resource 'projects/k8s-infra-e2e-boskos-054/global/firewalls/kt2-a50e119b-16f7-minion-nodeports' was not found
W0916 14:54:55.434319    2911 firewall.go:62] failed to delete nodeports firewall rules: might be deleted already?
I0916 14:54:55.434359    2911 down.go:59] releasing boskos project
I0916 14:54:55.457842    2911 boskos.go:83] Boskos heartbeat func received signal to close

/sig scheduling /sig network

k8s-ci-robot commented 3 years ago

@leonardpahlke: This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
aojea commented 3 years ago

@amwat @cheftako I didn't dig much into this failures, but 3/3 failures I saw are because the konnectivity agent fails to be scheduled due to cpu constraints, is it the environment or the konnectivity agent? or both?

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:74
Sep 15 23:47:02.903: Error waiting for all pods to be running and ready: 1 / 31 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
POD                      NODE PHASE   GRACE CONDITIONS
konnectivity-agent-l5xdt      Pending       [{Type:PodScheduled Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-09-15 23:36:24 +0000 UTC Reason:Unschedulable Message:0/4 nodes are available: 1 Insufficient cpu, 3 node(s) didn't match Pod's node affinity/selector.}]

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:77
aojea commented 3 years ago

/remove sig-network /remove-sig network

bridgetkromhout commented 3 years ago

/remove-sig network

amwat commented 3 years ago

I'm not sure if the environment changed recently. looking at the commit range when it started to fail

https://github.com/kubernetes/kubernetes/compare/4c014e5ca...1c1d2e4ed https://github.com/kubernetes/kubernetes/pull/102592 seems suspect

cc @pacoxu @cheftako

pacoxu commented 3 years ago

Sorry for that. Let me check.

102592 is to add toleration for konnectivity-agent. So it may be scheduled to a node with NoExcute taint node.

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:74
Sep 15 23:47:02.903: Error waiting for all pods to be running and ready: 1 / 31 pods in namespace "kube-system" are NOT in RUNNING and READY state in 10m0s
POD                      NODE PHASE   GRACE CONDITIONS
konnectivity-agent-l5xdt      Pending       [{Type:PodScheduled Status:False LastProbeTime:0001-01-01 00:00:00 +0000 UTC LastTransitionTime:2021-09-15 23:36:24 +0000 UTC Reason:Unschedulable Message:0/4 nodes are available: 1 Insufficient cpu, 3 node(s) didn't match Pod's node affinity/selector.}]

_output/dockerized/go/src/k8s.io/kubernetes/test/e2e/e2e.go:77

The master node is with NoSchedule taint.

                "taints": [
                    {
                        "key": "node-role.kubernetes.io/master",
                        "effect": "NoSchedule"
                    },
                    {
                        "key": "node.kubernetes.io/unschedulable",
                        "effect": "NoSchedule"
                    }
                ]

Should we remove the toleration for NoSchedule?

pacoxu commented 3 years ago

102592 is to fix #102582 which is just asked for NoExecute toleration.

I follow up kube-proxy and node-local-dns DaemonSet toleration settings and add both NoExecute and NoSchedule.

I opened #105084 to remove the NoSchedule toleration.

pacoxu commented 3 years ago

The CI is green now.