Service disruption on the production cluster, 11/19/2024

We had an unplanned service disruption on the production cluster this Tuesday, 11/19. I wanted to document what happened because I think there are several important lessons we can learn from this incident.

This all began from investigating some issues that Naved reported with the billing tools, in which pods were sometimes not labelled correctly by the Nvidia operator. In the course of looking at the cluster, we noticed that node wrk-98, which has a GPU, was reporting that it had 0 allocatable GPUs:

$ k get node wrk-98 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}{"\n"}'
0

Additionally, there were several pods associated with the GPU operator that should have been running on that node. For example, on a healthy node, we see:

$ k -n nvidia-gpu-operator get pod --field-selector spec.nodeName=wrk-97
NAME                                                  READY   STATUS      RESTARTS   AGE
gpu-feature-discovery-xfzhm                           1/1     Running     0          6d7h
nvidia-container-toolkit-daemonset-plvfk              1/1     Running     0          6d7h
nvidia-cuda-validator-b2rvj                           0/1     Completed   0          6d7h
nvidia-dcgm-exporter-bhzlg                            1/1     Running     0          6d7h
nvidia-dcgm-w8rxd                                     1/1     Running     0          6d7h
nvidia-device-plugin-daemonset-dpbrq                  1/1     Running     0          6d7h
nvidia-device-plugin-validator-zd4jl                  0/1     Completed   0          6d7h
nvidia-driver-daemonset-415.92.202407191425-0-m6hkg   2/2     Running     0          12d
nvidia-mig-manager-dhfzw                              1/1     Running     0          6d7h
nvidia-node-status-exporter-schzg                     1/1     Running     2          13d
nvidia-operator-validator-q7clp                       1/1     Running     0          6d7h

It was at first not obvious why these pods weren't running on wrk-98. Taking the gpu-feature-discovery pod as an example, the associated DaemonSet has the following nodeSelector:

$ k -n nvidia-gpu-operator get daemonset gpu-feature-discovery -o jsonpath='{.spec.template.spec.nodeSelector}{"\n"}'
{"nvidia.com/gpu.deploy.gpu-feature-discovery":"true"}

And node wrk-98 was already labelled appropriately with the matching label:

$ k get node wrk-98 -o jsonpath='{.metadata.labels.nvidia\.com/gpu\.deploy\.gpu-feature-discovery}{"\n"}'
true

Further investigation revealed that there was a taint on the node:

spec:
  taints:
  - effect: NoSchedule
    key: ai4dd-a100-reserved
    value: "true"

This was preventing any pods without the corresponding toleration from scheduling on that note, which in turn was preventing the nvidia operator from spawning the necessary pods on this node. That impacted the labelling provided by the feature discovery operator and the ability to allocate GPU jobs on this node.

Dylan reported that taint stemmed from a node reservation on 10/29:

...this is an artifact of the demo day, I though I removed all the taints but apparently not.. That taint can be safely removed.

...that would have been tainted as of Oct 29

I removed the taint from the node, which had the unexpected effect of putting the node into the NotReady,SchedulingDisabled state. It looks like there had been a pending MachineConfigPool that was blocked from progressing due to the node taint:

$ k get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-5f8feddd8e671945fd63b78f009bba95   True      False      False      3              3                   3                     0                      617d
worker   rendered-worker-5b8d0f7031ced0ee5a3b0fc2df99db52   False     True       False      44             42                  44                    0                      617d

When the taint was removed, the MCP update was able to proceed, causing the node to get cordoned and drained, which impacted a number of user workloads that were running on wrk-98.

Lessons learned

Pending/blocked MachineConfigPool updates are dangerous. We should raise an alert if a MachineConfigPool is pending for more than some amount of time. Maybe 30 minutes?
When we are deploying new or updated MachineConfig resources, we need to ensure that those changes are successfully applied to the cluster. MachineConfig updates are one of the few changes that involve node reboots, so we need to ensure they completed within a scheduled maintenance window.
Changes can have unexpected consequences, particularly if the cluster is unhealthy. It's important to validate the health of the cluster before making even trivial changes.

nerc-project / operations

Service disruption on the production cluster, 11/19/2024 #823

Lessons learned