nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
2 stars 0 forks source link

Missing GPU label from wrk-98 #669

Closed naved001 closed 2 months ago

naved001 commented 4 months ago

The machine and product label is missing:

naved@computer ~ % oc get node -o yaml wrk-98 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
    nvidia.com/gpu: "4"
    nvidia.com/gpu: "4"

You expect to see:

    nvidia.com/gpu.machine: ThinkSystem-SD650-N-V2
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB

Similar to https://github.com/nerc-project/operations/issues/558 but this time it's a different node.

naved001 commented 2 months ago

Not sure if anything was done, but wrk-98 now has all the labels, but wrk-97 is now missing those.

➜  ~ oc get node -o yaml wrk-98 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
    nvidia.com/gpu.machine: ThinkSystem-SD650-N-V2
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
    nvidia.com/gpu: "4"
    nvidia.com/gpu: "4"
➜  ~ oc get node -o yaml wrk-97 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
    nvidia.com/gpu: "4"
    nvidia.com/gpu: "4"

It would be nice to fix the issue with these nodes missing the labels since we'll hopefully be using these for scheduling reasons.

For the purposes of billing, I think a short-term solution would be to hard code the names of the nodes that have the Lenovo GPUs. @jtriley is it reasonable to assume that the nodes wrk-91 to wrk-101 will always be the Lenovo nodes? We have also been moving nodes in and out of openshift, so I want to make sure that these node names will not be reassigned to a different type of node. The billing code will still rely on getting the labels first, and if the labels are missing only then use the hard-coded node names to figure out the GPU type.

jtriley commented 2 months ago

For the purposes of billing, I think a short-term solution would be to hard code the names of the nodes that have the Lenovo GPUs. @jtriley is it reasonable to assume that the nodes wrk-91 to wrk-101 will always be the Lenovo nodes? We have also been moving nodes in and out of openshift, so I want to make sure that these node names will not be reassigned to a different type of node. The billing code will still rely on getting the labels first, and if the labels are missing only then use the hard-coded node names to figure out the GPU type.

Yes, the hostnames are tied to each host's mac address via DHCP. That sounds like a reasonable approach to me - driver crashes will happen from time to time so having a fallback is a good idea.

jtriley commented 2 months ago

@naved001 It turns out this time around I was able to restore the labels simply by deleting/relaunching the discovery deployment pods running on wrk-97 that were in the openshift-nfd and nvidia-gpu-operator namespaces. I checked the nvidia-driver-daemonset pod that's running on that host and I'm able to interact with nvidia-smi which suggests the driver is fine and doesn't require a reboot. Still not clear what the root cause is for losing the labels.