Open Damowerko opened 1 year ago
You can try to add a shutdown command to the set-compute-mode container.
This container must wait and run nvidia-smi -c 0
when it receives a SIGINT.
When able I will add a preStop hook to the container and test if this resolves the issue.
Have you seen this MR ? https://github.com/NVIDIA/k8s-device-plugin/pull/490 Maybe you can use the mps daemon from nvidia.
@Baenimyr Good that the device plugin supports MPS now. The problem is that it does not scale dynamically. Of course, NOS could use the NVIDIA plugin now. However, with the NVIDIA DRA driver on the horizon, it does not make sense for me personally to use NOS.
In my use-case I am often enabling and disabling NOS on individual nodes by adding/removing the label
nos.nebuly.com/gpu-partitioning=mps
. After labeling the node, NOS will change the GPU mode to exclusive. However, after removing the label, the GPU remains in exclusive mode.Expected behavior: NOS should revert the GPU mode to whatever it was when it started or to default.
Workaround: Change back to default mode (or whatever mode you want) after removing the label. Do this for all GPUs. For example, to change the mode on GPU 0 back to default use the following.