kubeflow / pipelines

Machine Learning Pipelines for Kubeflow
https://www.kubeflow.org/docs/components/pipelines/
Apache License 2.0
3.53k stars 1.59k forks source link

[Testing] presubmit tests flaky without log #4095

Closed Bobgy closed 4 months ago

Bobgy commented 4 years ago

Upstream issue reported https://github.com/kubernetes/test-infra/issues/18067.

but it turned out kubernetes/test-infra only maintains the central prow instance, but they don't support kubeflow's CI. Kubeflow test pods are launched in "kubeflow-testing" cluster of "kubeflow-ci" project and "test-pods" namespace.

Reference: https://github.com/kubeflow/testing#debugging-failed-tests

Bobgy commented 4 years ago

Found error log of one example failure, in container initupload:

{"component":"initupload","error":"failed to upload to blob storage: failed to upload to blob storage: encountered errors during upload: [writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout upload error: writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json\u0026prettyPrint=false\u0026projection=full\u0026uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]","file":"prow/cmd/initupload/main.go:45","func":"main.main","level":"fatal","msg":"Failed to initialize job","severity":"fatal","time":"2020-06-28T06:47:28Z"}

For this test: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/4071/kubeflow-pipeline-backend-test/1277128159166205952

Bobgy commented 4 years ago

Found another failure when container test succeeded, but container sidecar failed to upload status with the same kind of error message.

{ "func": "main.main", "file": "prow/cmd/sidecar/main.go:43", "level": "error", "component": "sidecar", "error": "failed to upload to GCS: failed to upload to blob storage: encountered errors during upload: [writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout writer close error: Post https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?alt=json&prettyPrint=false&projection=full&uploadType=multipart: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: net/http: TLS handshake timeout]", "msg": "Failed to report job status" }

For this test https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_pipelines/4071/kubeflow-pipeline-backend-test/1277132940752457729

Bobgy commented 4 years ago

Taking a deeper look at all the test pods, I found that all failed pods with 255 exit code similar to above two examples ran on the same node. I'm suspecting the node became unhealthy.

test project url (requires kubeflow ci admin permission))

Bobgy commented 4 years ago

/cc @jlewi

Bobgy commented 4 years ago

I got my tests in https://github.com/kubeflow/pipelines/pull/4071 passing after cordoning the suspected unhealthy node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-vxsb.

I deleted the node and brought up a new node, let's see whether it recovers all tests.

Bobgy commented 4 years ago

Presubmit tests remain functional today, looks like the fix I made yesterday was correct.

I'll try to dig into details how this can be prevented next time if it happens again. /cc @jlewi in case you know more details about this.

Bobgy commented 4 years ago

I'm seeing this again, I found one node is not ready, so I deleted it again.

Here's its status:

Events:
  Type     Reason                   Age                 From                                                              Message
  ----     ------                   ----                ----                                                              -------
  Normal   NodeNotReady             58m (x4 over 157m)  kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeNotReady
  Normal   NodeReady                58m (x4 over 157m)  kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeReady
  Warning  TaskHung                 51m                 kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:1422 blocked for more than 300 seconds.
  Normal   Starting                 47m                 kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Starting kubelet.
  Normal   NodeHasSufficientMemory  47m (x2 over 47m)   kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeHasSufficientMemory
  Normal   NodeHasNoDiskPressure    47m (x2 over 47m)   kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeHasNoDiskPressure
  Normal   NodeHasSufficientPID     47m (x2 over 47m)   kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeHasSufficientPID
  Warning  Rebooted                 47m                 kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 has been rebooted, boot id: ee4e1ff0-b3f4-4d63-8bfb-fba676800cd4
  Normal   NodeAllocatableEnforced  47m                 kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Updated Node Allocatable limit across pods
  Normal   Starting                 47m                 kube-proxy, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7      Starting kube-proxy.
  Normal   NodeNotReady             44m (x2 over 47m)   kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeNotReady
  Normal   NodeReady                44m (x2 over 47m)   kubelet, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7         Node gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7 status is now: NodeReady
  Warning  OOMKilling               29m                 kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  Memory cgroup out of memory: Kill process 4012 (k8s_metadata) score 30 or sacrifice child
Killed process 4012 (k8s_metadata) total-vm:1067420kB, anon-rss:24488kB, file-rss:0kB, shmem-rss:0kB
  Warning  TaskHung  2m19s  kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:1421 blocked for more than 300 seconds.
  Warning  TaskHung  2m14s  kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:1424 blocked for more than 300 seconds.
  Warning  TaskHung  2m9s   kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:1700 blocked for more than 300 seconds.
  Warning  TaskHung  2m5s   kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:1894 blocked for more than 300 seconds.
  Warning  TaskHung  2m1s   kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:1962 blocked for more than 300 seconds.
  Warning  TaskHung  117s   kernel-monitor, gke-kubeflow-testing-temp-cpu-pool-3b385d9f-6rg7  INFO: task dockerd:2096 blocked for more than 300 seconds.

Conditions:
  Type                          Status    LastHeartbeatTime                 LastTransitionTime                Reason                          Message
  ----                          ------    -----------------                 ------------------                ------                          -------
  CorruptDockerOverlay2         False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   NoCorruptDockerOverlay2         docker overlay2 is functioning properly
  FrequentUnregisterNetDevice   False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   NoFrequentUnregisterNetDevice   node is functioning properly
  KernelDeadlock                False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   KernelHasNoDeadlock             kernel has no deadlock
  ReadonlyFilesystem            False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   FilesystemIsNotReadOnly         Filesystem is not read-only
  FrequentKubeletRestart        False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   NoFrequentKubeletRestart        kubelet is functioning properly
  FrequentDockerRestart         False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   NoFrequentDockerRestart         docker is functioning properly
  FrequentContainerdRestart     False     Mon, 03 Aug 2020 14:39:35 +0800   Mon, 03 Aug 2020 13:54:02 +0800   NoFrequentContainerdRestart     containerd is functioning properly
  NetworkUnavailable            False     Sun, 02 Aug 2020 23:43:28 +0800   Sun, 02 Aug 2020 23:43:28 +0800   RouteCreated                    RouteController created a route
  MemoryPressure                Unknown   Mon, 03 Aug 2020 14:39:27 +0800   Mon, 03 Aug 2020 14:40:46 +0800   NodeStatusUnknown               Kubelet stopped posting node status.
  DiskPressure                  Unknown   Mon, 03 Aug 2020 14:39:27 +0800   Mon, 03 Aug 2020 14:40:46 +0800   NodeStatusUnknown               Kubelet stopped posting node status.
  PIDPressure                   Unknown   Mon, 03 Aug 2020 14:39:27 +0800   Mon, 03 Aug 2020 14:40:46 +0800   NodeStatusUnknown               Kubelet stopped posting node status.
  Ready                         Unknown   Mon, 03 Aug 2020 14:39:27 +0800   Mon, 03 Aug 2020 14:40:46 +0800   NodeStatusUnknown               Kubelet stopped posting node status.

Looks like the kubelet failed, so some status checks have been failing.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Bobgy commented 3 years ago

/lifecycle frozen

Bobgy commented 3 years ago

This happened a few more times, so far it can be solved by restarting nodes.

rimolive commented 4 months ago

Closing this issue. No activity for more than a year.

/close

google-oss-prow[bot] commented 4 months ago

@rimolive: Closing this issue.

In response to [this](https://github.com/kubeflow/pipelines/issues/4095#issuecomment-2031949726): >Closing this issue. No activity for more than a year. > >/close Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.