Open nuclearcat opened 2 years ago
The wait.py
script is going to be removed, all the scheduling will be done as part of the new pipeline and this will include LAVA too.
Probably this can be done as a tool to troubleshoot current k8s issues until it get replaced? I believe it is possible to do with minor code notifications. And collected information (failure cases) might be handy for future pipeline development.
I think we should be using standard Kubernetes tools to gather the logs.
On Tue, Aug 09, 2022 at 03:29:15AM -0700, Denys Fedoryshchenko wrote:
I suggest we add to wait.py periodic fetch of 'describe pod', as it might contain events if pod failed to schedule, or node failed to scale.
We probably want to deploy fluentd or something similar to collect logs here.
Also i suspect some kernel builds are failing because builder fall to node which have already too long build queue, maybe worth to log current state of specific context where task is scheduled, e.g. how many tasks are running, pending, failed.
Kubernetes is generally very tolerant of long queues however yes, it'd be good to take account of how backlogged things are to better take advantage of uneven cluster sizes - I do have some code to enumerate the clusters we can see which could be used to pick the least busy when submitting jobs.
We probably want to deploy fluentd or something similar to collect logs here.
I noticed interesting issue that appears before any logs can be collected, i launched build task, pod got scheduled, logs was empty, but kubectl describe pod XXX was giving interesting events:
Warning FailedScheduling 5m27s default-scheduler 0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient cpu.
Warning FailedScheduling 4m19s (x1 over 5m27s) default-scheduler 0/3 nodes are available: 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient cpu.
Warning FailedScheduling 4m14s default-scheduler 0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate, 1 node(s) had taint {node.kubernetes.io/unreachable: }, that the pod didn't tolerate, 2 Insufficient cpu.
Normal Scheduled 4m6s default-scheduler Successfully assigned default/build-j16-arm64-clang-15-gki-defconfig-jfg99 to aks-spotpool0-15384011-vmss0001y4
Normal TriggeredScaleUp 5m20s cluster-autoscaler pod triggered scale-up: [{aks-spotpool0-15384011-vmss 2->3 (max: 3)}]
Normal Pulling 4m6s kubelet Pulling image "kernelci/staging-clang-15:kselftest-kernelci"
Normal Pulled 3m37s kubelet Successfully pulled image "kernelci/staging-clang-15:kselftest-kernelci" in 28.518747761s
Normal Created 118s (x5 over 3m27s) kubelet Created container kernelci-core-setup
Normal Started 118s (x5 over 3m27s) kubelet Started container kernelci-core-setup
Normal Pulled 118s (x4 over 3m27s) kubelet Container image "kernelci/staging-clang-15:kselftest-kernelci" already present on machine
Warning BackOff 90s (x9 over 3m21s) kubelet Back-off restarting failed container
I am not sure if kubernetes can stream pod events to fluentd. What is worst part, when job did timeout in jenkins, pod just disappeared at that moment from list and it was too late to run "describe". So likely we have to loop/poll "describe pod", after pod launch to catch events.
fluentd collects that sort of stuff AIUI - also note here that there's a "Back-off restarting failed container". That suggests that the container is running but crashing. My guess would be something like exceeding resource limits either set on the job or on the node as a whole (eg, trying to use more disk space than is available).
The "Failed scheduling" is normal for autoscaled clusters, the autoscaler uses pending unschedulable jobs to trigger scaling.
It's not before logs can be collected BTW - what happens with Kubernetes jobs is that the logs are on the pod level and the pods are created dynamically from the jobs and when a job completes (even if in failure) the job will clean up the pod. This means that the pod might be gone by the time you go look at the job unless you use something like fluentd which grabs logs as they appear.
We have some number jobs failing due k8s issues and at current moment we have only pod logs, which are often empty. For example:
I suggest we add to wait.py periodic fetch of 'describe pod', as it might contain events if pod failed to schedule, or node failed to scale.
Also i suspect some kernel builds are failing because builder fall to node which have already too long build queue, maybe worth to log current state of specific context where task is scheduled, e.g. how many tasks are running, pending, failed.