Closed naved001 closed 1 month ago
@jtriley or @aabaris is this as easy a fix as it sounds?
Looking at the pods in the nvidia-gpu-operator
namespace we have some that are in crashloop and failing, most notably the nvidia-container-toolkit-daemonset
pod running on that node:
nvidia-container-toolkit-daemonset-jb795 driver-validation Unable to determine the device handle for GPU0000:CA:00.0: Unknown Error
nvidia-container-toolkit-daemonset-jb795 driver-validation command failed, retrying after 5 seconds
I would suggest this node needs to be drained and rebooted before trying other measures.
@jtriley is draining and rebooting this node something you can do once the freeze is over?
@jtriley, bringing this back up now that the freeze is over.
Trying to drain the host yields:
cannot delete Pods with local storage (use --delete-emptydir-data to override): ece440spring2024-619f12/autograder-deployment-5d45468fb6-mtphf, hosting-of-medical-image-analysis-platform-dcb83b/cube1-pfdcm-fff65c56f-2bb5m,
Typically we pass --delete-emptydir-data
if these are OpenShift/system pods given that we are fairly confident they'll come back cleanly. However, the following are not OpenShift/system pods:
ece440spring2024-619f12/autograder-deployment-5d45468fb6-mtphf, hosting-of-medical-image-analysis-platform-dcb83b/cube1-pfdcm-fff65c56f-2bb5m
We'll need to reach out to these users for confirmation that we can delete them safely.
@msdisme this looks like Spring ECE 440 class can we delete this pod?
The last remaining user workload I can see is this one:
cannot delete Pods with local storage (use --delete-emptydir-data to override): hosting-of-medical-image-analysis-platform-dcb83b/cube1-pfdcm-fff65c56f-2bb5m
That pod appears to be using a 1G emptyDir volume mounted at /home/dicom
. We should reach out and see if they care to back that up before we bounce the host given that the data will be lost at that point.
@Milstein please add to this issue any update you receive from Rudolph about shutting down this pod.
@jtriley: I confirm this worker node can be rebooted. They are not using that mount empty dir to store temp data only.
The wrk-91 node has been rebooted and this issue appears to be resolved now:
$ oc get node -o yaml wrk-91 | grep -iE 'nvidia.com/gpu(:|.product:|.machine:)'
nvidia.com/gpu.machine: ThinkSystem-SD650-N-V2
nvidia.com/gpu.product: NVIDIA-A100-SXM4-40GB
nvidia.com/gpu: "4"
nvidia.com/gpu: "4"
The node labels
nvidia.com/gpu.product
andnvidia.com/gpu.machine
are missing from wrk-91 which is a Lenovo system with the A100s. It does have some other nvidia labels like the count of GPUs.Other nodes are correctly labeled as far as I can tell
This affects the billing process as we use those labels to determine the type of GPU a pod was run on.