Closed naved001 closed 2 months ago
Not sure if anything was done, but wrk-98 now has all the labels, but wrk-97 is now missing those.
➜ ~ oc get node -o yaml wrk-98 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
nvidia.com/gpu.machine: ThinkSystem-SD650-N-V2
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
nvidia.com/gpu: "4"
nvidia.com/gpu: "4"
➜ ~ oc get node -o yaml wrk-97 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
nvidia.com/gpu: "4"
nvidia.com/gpu: "4"
It would be nice to fix the issue with these nodes missing the labels since we'll hopefully be using these for scheduling reasons.
For the purposes of billing, I think a short-term solution would be to hard code the names of the nodes that have the Lenovo GPUs. @jtriley is it reasonable to assume that the nodes wrk-91
to wrk-101
will always be the Lenovo nodes? We have also been moving nodes in and out of openshift, so I want to make sure that these node names will not be reassigned to a different type of node. The billing code will still rely on getting the labels first, and if the labels are missing only then use the hard-coded node names to figure out the GPU type.
For the purposes of billing, I think a short-term solution would be to hard code the names of the nodes that have the Lenovo GPUs. @jtriley is it reasonable to assume that the nodes
wrk-91
towrk-101
will always be the Lenovo nodes? We have also been moving nodes in and out of openshift, so I want to make sure that these node names will not be reassigned to a different type of node. The billing code will still rely on getting the labels first, and if the labels are missing only then use the hard-coded node names to figure out the GPU type.
Yes, the hostnames are tied to each host's mac address via DHCP. That sounds like a reasonable approach to me - driver crashes will happen from time to time so having a fallback is a good idea.
@naved001 It turns out this time around I was able to restore the labels simply by deleting/relaunching the discovery deployment pods running on wrk-97
that were in the openshift-nfd
and nvidia-gpu-operator
namespaces. I checked the nvidia-driver-daemonset
pod that's running on that host and I'm able to interact with nvidia-smi
which suggests the driver is fine and doesn't require a reboot. Still not clear what the root cause is for losing the labels.
The machine and product label is missing:
You expect to see:
Similar to https://github.com/nerc-project/operations/issues/558 but this time it's a different node.