nerc-project / operations

Issues related to the operation of the NERC OpenShift environment
1 stars 0 forks source link

Missing GPU node labels on wrk-91 #558

Closed naved001 closed 1 month ago

naved001 commented 2 months ago

The node labels nvidia.com/gpu.product and nvidia.com/gpu.machine are missing from wrk-91 which is a Lenovo system with the A100s. It does have some other nvidia labels like the count of GPUs.

naved@computer ~ % oc get node wrk-91 -o yaml |grep "nvidia.com/gpu:"
    nvidia.com/gpu: "4"
    nvidia.com/gpu: "4"
naved@computer ~ % oc get node wrk-91 -o yaml |grep "nvidia.com/gpu.machine:"
naved@computer ~ % oc get node wrk-91 -o yaml |grep "nvidia.com/gpu.product:"
naved@computer ~ %

Other nodes are correctly labeled as far as I can tell

naved@computer ~ % oc get node wrk-92 -o yaml |grep "nvidia.com/gpu.product:"
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
naved@computer ~ % oc get node wrk-89 -o yaml |grep "nvidia.com/gpu.product:"
    nvidia.com/gpu.product: Tesla-V100-PCIE-32GB
naved@computer ~ % oc get node wrk-93 -o yaml |grep "nvidia.com/gpu.product:"
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
naved@computer ~ % oc get node wrk-94 -o yaml |grep "nvidia.com/gpu.product:"
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB****

This affects the billing process as we use those labels to determine the type of GPU a pod was run on.

joachimweyl commented 2 months ago

@jtriley or @aabaris is this as easy a fix as it sounds?

jtriley commented 2 months ago

Looking at the pods in the nvidia-gpu-operator namespace we have some that are in crashloop and failing, most notably the nvidia-container-toolkit-daemonset pod running on that node:

nvidia-container-toolkit-daemonset-jb795 driver-validation Unable to determine the device handle for GPU0000:CA:00.0: Unknown Error
nvidia-container-toolkit-daemonset-jb795 driver-validation command failed, retrying after 5 seconds

I would suggest this node needs to be drained and rebooted before trying other measures.

joachimweyl commented 1 month ago

@jtriley is draining and rebooting this node something you can do once the freeze is over?

joachimweyl commented 1 month ago

@jtriley, bringing this back up now that the freeze is over.

jtriley commented 1 month ago

Trying to drain the host yields:

cannot delete Pods with local storage (use --delete-emptydir-data to override): ece440spring2024-619f12/autograder-deployment-5d45468fb6-mtphf, hosting-of-medical-image-analysis-platform-dcb83b/cube1-pfdcm-fff65c56f-2bb5m, 

Typically we pass --delete-emptydir-data if these are OpenShift/system pods given that we are fairly confident they'll come back cleanly. However, the following are not OpenShift/system pods:

ece440spring2024-619f12/autograder-deployment-5d45468fb6-mtphf, hosting-of-medical-image-analysis-platform-dcb83b/cube1-pfdcm-fff65c56f-2bb5m

We'll need to reach out to these users for confirmation that we can delete them safely.

joachimweyl commented 1 month ago

@msdisme this looks like Spring ECE 440 class can we delete this pod?

dystewart commented 1 month ago

ece440spring2024-619f12/autograder-deployment-5d45468fb6-mtphf has been deleted

Can see a list of still failing pods on that node here

2 are nvidia operator related and one looks to be related to a research proect: pod

jtriley commented 1 month ago

The last remaining user workload I can see is this one:

cannot delete Pods with local storage (use --delete-emptydir-data to override): hosting-of-medical-image-analysis-platform-dcb83b/cube1-pfdcm-fff65c56f-2bb5m

That pod appears to be using a 1G emptyDir volume mounted at /home/dicom. We should reach out and see if they care to back that up before we bounce the host given that the data will be lost at that point.

joachimweyl commented 1 month ago

@Milstein please add to this issue any update you receive from Rudolph about shutting down this pod.

Milstein commented 1 month ago

@jtriley: I confirm this worker node can be rebooted. They are not using that mount empty dir to store temp data only.

jtriley commented 1 month ago

The wrk-91 node has been rebooted and this issue appears to be resolved now:


$ oc get node -o yaml wrk-91 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
    nvidia.com/gpu.machine: ThinkSystem-SD650-N-V2
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
    nvidia.com/gpu: "4"
    nvidia.com/gpu: "4"