We had an unplanned service disruption on the production cluster this Tuesday, 11/19. I wanted to document what happened because I think there are several important lessons we can learn from this incident.
This all began from investigating some issues that Naved reported with the billing tools, in which pods were sometimes not labelled correctly by the Nvidia operator. In the course of looking at the cluster, we noticed that node wrk-98, which has a GPU, was reporting that it had 0 allocatable GPUs:
$ k get node wrk-98 -o jsonpath='{.status.allocatable.nvidia\.com/gpu}{"\n"}'
0
Additionally, there were several pods associated with the GPU operator that should have been running on that node. For example, on a healthy node, we see:
It was at first not obvious why these pods weren't running on wrk-98. Taking the gpu-feature-discovery pod as an example, the associated DaemonSet has the following nodeSelector:
$ k -n nvidia-gpu-operator get daemonset gpu-feature-discovery -o jsonpath='{.spec.template.spec.nodeSelector}{"\n"}'
{"nvidia.com/gpu.deploy.gpu-feature-discovery":"true"}
And node wrk-98 was already labelled appropriately with the matching label:
$ k get node wrk-98 -o jsonpath='{.metadata.labels.nvidia\.com/gpu\.deploy\.gpu-feature-discovery}{"\n"}'
true
Further investigation revealed that there was a taint on the node:
This was preventing any pods without the corresponding toleration from scheduling on that note, which in turn was preventing the nvidia operator from spawning the necessary pods on this node. That impacted the labelling provided by the feature discovery operator and the ability to allocate GPU jobs on this node.
Dylan reported that taint stemmed from a node reservation on 10/29:
...this is an artifact of the demo day, I though I removed all the taints but
apparently not.. That taint can be safely removed.
...that would have been tainted as of Oct 29
I removed the taint from the node, which had the unexpected effect of putting the node into the NotReady,SchedulingDisabled state. It looks like there had been a pending MachineConfigPool that was blocked from progressing due to the node taint:
$ k get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-5f8feddd8e671945fd63b78f009bba95 True False False 3 3 3 0 617d
worker rendered-worker-5b8d0f7031ced0ee5a3b0fc2df99db52 False True False 44 42 44 0 617d
When the taint was removed, the MCP update was able to proceed, causing the node to get cordoned and drained, which impacted a number of user workloads that were running on wrk-98.
Lessons learned
Pending/blocked MachineConfigPool updates are dangerous. We should raise an alert if a MachineConfigPool is pending for more than some amount of time. Maybe 30 minutes?
When we are deploying new or updated MachineConfig resources, we need to ensure that those changes are successfully applied to the cluster. MachineConfig updates are one of the few changes that involve node reboots, so we need to ensure they completed within a scheduled maintenance window.
Changes can have unexpected consequences, particularly if the cluster is unhealthy. It's important to validate the health of the cluster before making even trivial changes.
We had an unplanned service disruption on the production cluster this Tuesday, 11/19. I wanted to document what happened because I think there are several important lessons we can learn from this incident.
This all began from investigating some issues that Naved reported with the billing tools, in which pods were sometimes not labelled correctly by the Nvidia operator. In the course of looking at the cluster, we noticed that node
wrk-98
, which has a GPU, was reporting that it had 0 allocatable GPUs:Additionally, there were several pods associated with the GPU operator that should have been running on that node. For example, on a healthy node, we see:
It was at first not obvious why these pods weren't running on
wrk-98
. Taking thegpu-feature-discovery
pod as an example, the associated DaemonSet has the followingnodeSelector
:And node
wrk-98
was already labelled appropriately with the matching label:Further investigation revealed that there was a taint on the node:
This was preventing any pods without the corresponding toleration from scheduling on that note, which in turn was preventing the nvidia operator from spawning the necessary pods on this node. That impacted the labelling provided by the feature discovery operator and the ability to allocate GPU jobs on this node.
Dylan reported that taint stemmed from a node reservation on 10/29:
I removed the taint from the node, which had the unexpected effect of putting the node into the
NotReady,SchedulingDisabled
state. It looks like there had been a pending MachineConfigPool that was blocked from progressing due to the node taint:When the taint was removed, the MCP update was able to proceed, causing the node to get cordoned and drained, which impacted a number of user workloads that were running on
wrk-98
.Lessons learned
Pending/blocked MachineConfigPool updates are dangerous. We should raise an alert if a MachineConfigPool is pending for more than some amount of time. Maybe 30 minutes?
When we are deploying new or updated MachineConfig resources, we need to ensure that those changes are successfully applied to the cluster. MachineConfig updates are one of the few changes that involve node reboots, so we need to ensure they completed within a scheduled maintenance window.
Changes can have unexpected consequences, particularly if the cluster is unhealthy. It's important to validate the health of the cluster before making even trivial changes.