Improving k8s scheduling and troubleshooting

nuclearcat commented 2 years ago

We have some number jobs failing due k8s issues and at current moment we have only pod logs, which are often empty. For example:

Using context: gke_android-kernelci-external_us-east4-c_kci-big-us-east4
Waiting for job completion. (recheck every 60 sec) 
<<many dots skipped
Cancelling nested steps due to timeout
Sending interrupt signal to process
Terminated
script returned exit code 143

I suggest we add to wait.py periodic fetch of 'describe pod', as it might contain events if pod failed to schedule, or node failed to scale.

Also i suspect some kernel builds are failing because builder fall to node which have already too long build queue, maybe worth to log current state of specific context where task is scheduled, e.g. how many tasks are running, pending, failed.

gctucker commented 2 years ago

The wait.py script is going to be removed, all the scheduling will be done as part of the new pipeline and this will include LAVA too.

nuclearcat commented 2 years ago

Probably this can be done as a tool to troubleshoot current k8s issues until it get replaced? I believe it is possible to do with minor code notifications. And collected information (failure cases) might be handy for future pipeline development.

gctucker commented 2 years ago

I think we should be using standard Kubernetes tools to gather the logs.

broonie commented 2 years ago

On Tue, Aug 09, 2022 at 03:29:15AM -0700, Denys Fedoryshchenko wrote:

I suggest we add to wait.py periodic fetch of 'describe pod', as it might contain events if pod failed to schedule, or node failed to scale.

We probably want to deploy fluentd or something similar to collect logs here.

Also i suspect some kernel builds are failing because builder fall to node which have already too long build queue, maybe worth to log current state of specific context where task is scheduled, e.g. how many tasks are running, pending, failed.

Kubernetes is generally very tolerant of long queues however yes, it'd be good to take account of how backlogged things are to better take advantage of uneven cluster sizes - I do have some code to enumerate the clusters we can see which could be used to pick the least busy when submitting jobs.

nuclearcat commented 2 years ago

We probably want to deploy fluentd or something similar to collect logs here.

I noticed interesting issue that appears before any logs can be collected, i launched build task, pod got scheduled, logs was empty, but kubectl describe pod XXX was giving interesting events:

  Warning  FailedScheduling  5m27s                  default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient cpu.
  Warning  FailedScheduling  4m19s (x1 over 5m27s)  default-scheduler   0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient cpu.
  Warning  FailedScheduling  4m14s                  default-scheduler   0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient cpu.
  Normal   Scheduled         4m6s                   default-scheduler   Successfully assigned default/build-j16-arm64-clang-15-gki-defconfig-jfg99 to aks-spotpool0-15384011-vmss0001y4
  Normal   TriggeredScaleUp  5m20s                  cluster-autoscaler  pod triggered scale-up: [{aks-spotpool0-15384011-vmss 2->3 (max: 3)}]
  Normal   Pulling           4m6s                   kubelet             Pulling image "kernelci/staging-clang-15:kselftest-kernelci"
  Normal   Pulled            3m37s                  kubelet             Successfully pulled image "kernelci/staging-clang-15:kselftest-kernelci" in 28.518747761s
  Normal   Created           118s (x5 over 3m27s)   kubelet             Created container kernelci-core-setup
  Normal   Started           118s (x5 over 3m27s)   kubelet             Started container kernelci-core-setup
  Normal   Pulled            118s (x4 over 3m27s)   kubelet             Container image "kernelci/staging-clang-15:kselftest-kernelci" already present on machine
  Warning  BackOff           90s (x9 over 3m21s)    kubelet             Back-off restarting failed container

I am not sure if kubernetes can stream pod events to fluentd. What is worst part, when job did timeout in jenkins, pod just disappeared at that moment from list and it was too late to run "describe". So likely we have to loop/poll "describe pod", after pod launch to catch events.

broonie commented 2 years ago

fluentd collects that sort of stuff AIUI - also note here that there's a "Back-off restarting failed container". That suggests that the container is running but crashing. My guess would be something like exceeding resource limits either set on the job or on the node as a whole (eg, trying to use more disk space than is available).

The "Failed scheduling" is normal for autoscaled clusters, the autoscaler uses pending unschedulable jobs to trigger scaling.

broonie commented 2 years ago

It's not before logs can be collected BTW - what happens with Kubernetes jobs is that the logs are on the pod level and the pods are created dynamically from the jobs and when a job completes (even if in failure) the job will clean up the pod. This means that the pod might be gone by the time you go look at the job unless you use something like fluentd which grabs logs as they appear.

kernelci / kernelci-core

Improving k8s scheduling and troubleshooting #1357