Closed Bobgy closed 4 months ago
Found error log of one example failure, in container initupload
:
{"component":"initupload","error":"failed to upload to blob storage: failed to upload to blob storage: encountered errors during upload: [writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout upload error: writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]","file":"prow/cmd/initupload/main.go:45","func":"main.main","level":"fatal","msg":"Failed to initialize job","severity":"fatal","time":"2020-06-28T06:47:28Z"}
For this test: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/4071/kubeflow-pipeline-backend-test/1277128159166205952
Found another failure when container test
succeeded, but container sidecar
failed to upload status with the same kind of error message.
{ "func": "main.main", "file": "prow/cmd/sidecar/main.go:43", "level": "error", "component": "sidecar", "error": "failed to upload to GCS: failed to upload to blob storage: encountered errors during upload: [writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]", "msg": "Failed to report job status" }
Taking a deeper look at all the test pods, I found that all failed pods with 255 exit code similar to above two examples ran on the same node. I'm suspecting the node became unhealthy.
/cc @jlewi
I got my tests in https://github.com/kubeflow/pipelines/pull/4071 passing after cordoning the suspected unhealthy node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-vxsb.
I deleted the node and brought up a new node, let's see whether it recovers all tests.
Presubmit tests remain functional today, looks like the fix I made yesterday was correct.
I'll try to dig into details how this can be prevented next time if it happens again. /cc @jlewi in case you know more details about this.
I'm seeing this again, I found one node is not ready, so I deleted it again.
Here's its status:
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NodeNotReady 58m (x4 over 157m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeNotReady
Normal NodeReady 58m (x4 over 157m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeReady
Warning TaskHung 51m kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:1422 blocked for more than 300 seconds.
Normal Starting 47m kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Starting kubelet.
Normal NodeHasSufficientMemory 47m (x2 over 47m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 47m (x2 over 47m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 47m (x2 over 47m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeHasSufficientPID
Warning Rebooted 47m kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 has been rebooted, boot id: ee4e1ff0-b3f4-4d63-8bfb-fba676800cd4
Normal NodeAllocatableEnforced 47m kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Updated Node Allocatable limit across pods
Normal Starting 47m kube-proxy, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Starting kube-proxy.
Normal NodeNotReady 44m (x2 over 47m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeNotReady
Normal NodeReady 44m (x2 over 47m) kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeReady
Warning OOMKilling 29m kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 Memory cgroup out of memory: Kill process 4012 (k8s_metadata) score 30 or sacrifice child
Killed process 4012 (k8s_metadata) total-vm:1067420kB, anon-rss:24488kB, file-rss:0kB, shmem-rss:0kB
Warning TaskHung 2m19s kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:1421 blocked for more than 300 seconds.
Warning TaskHung 2m14s kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:1424 blocked for more than 300 seconds.
Warning TaskHung 2m9s kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:1700 blocked for more than 300 seconds.
Warning TaskHung 2m5s kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:1894 blocked for more than 300 seconds.
Warning TaskHung 2m1s kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:1962 blocked for more than 300 seconds.
Warning TaskHung 117s kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 INFO: task dockerd:2096 blocked for more than 300 seconds.
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
CorruptDockerOverlay2 False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 NoCorruptDockerOverlay2 docker overlay2 is functioning properly
FrequentUnregisterNetDevice False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 NoFrequentUnregisterNetDevice node is functioning properly
KernelDeadlock False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 FilesystemIsNotReadOnly Filesystem is not read-only
FrequentKubeletRestart False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 NoFrequentKubeletRestart kubelet is functioning properly
FrequentDockerRestart False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 NoFrequentDockerRestart docker is functioning properly
FrequentContainerdRestart False Mon, 03 Aug 2020 14:39:35 +0800 Mon, 03 Aug 2020 13:54:02 +0800 NoFrequentContainerdRestart containerd is functioning properly
NetworkUnavailable False Sun, 02 Aug 2020 23:43:28 +0800 Sun, 02 Aug 2020 23:43:28 +0800 RouteCreated RouteController created a route
MemoryPressure Unknown Mon, 03 Aug 2020 14:39:27 +0800 Mon, 03 Aug 2020 14:40:46 +0800 NodeStatusUnknown Kubelet stopped posting node status.
DiskPressure Unknown Mon, 03 Aug 2020 14:39:27 +0800 Mon, 03 Aug 2020 14:40:46 +0800 NodeStatusUnknown Kubelet stopped posting node status.
PIDPressure Unknown Mon, 03 Aug 2020 14:39:27 +0800 Mon, 03 Aug 2020 14:40:46 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Ready Unknown Mon, 03 Aug 2020 14:39:27 +0800 Mon, 03 Aug 2020 14:40:46 +0800 NodeStatusUnknown Kubelet stopped posting node status.
Looks like the kubelet failed, so some status checks have been failing.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
/lifecycle frozen
This happened a few more times, so far it can be solved by restarting nodes.
Closing this issue. No activity for more than a year.
/close
@rimolive: Closing this issue.
Upstream issue reported https://github.com/kubernetes/test-infra/issues/18067.
but it turned out kubernetes/test-infra only maintains the central prow instance, but they don't support kubeflow's CI. Kubeflow test pods are launched in "kubeflow-testing" cluster of "kubeflow-ci" project and "test-pods" namespace.
Reference: https://github.com/kubeflow/testing#debugging-failed-tests