nebuly-ai / nos

Module to Automatically maximize the utilization of GPU resources in a Kubernetes cluster through real-time dynamic partitioning and elastic quotas - Effortless optimization at its finest!
https://www.nebuly.com/
Apache License 2.0
632 stars 36 forks source link

NOS MPS leaves GPUs on node in exclusive mode #27

Open Damowerko opened 1 year ago

Damowerko commented 1 year ago

In my use-case I am often enabling and disabling NOS on individual nodes by adding/removing the label nos.nebuly.com/gpu-partitioning=mps. After labeling the node, NOS will change the GPU mode to exclusive. However, after removing the label, the GPU remains in exclusive mode.

Expected behavior: NOS should revert the GPU mode to whatever it was when it started or to default.

Workaround: Change back to default mode (or whatever mode you want) after removing the label. Do this for all GPUs. For example, to change the mode on GPU 0 back to default use the following.

nvidia-smi -i 0 -c 0
Baenimyr commented 10 months ago

You can try to add a shutdown command to the set-compute-mode container. This container must wait and run nvidia-smi -c 0 when it receives a SIGINT.

Damowerko commented 9 months ago

When able I will add a preStop hook to the container and test if this resolves the issue.

Baenimyr commented 9 months ago

Have you seen this MR ? https://github.com/NVIDIA/k8s-device-plugin/pull/490 Maybe you can use the mps daemon from nvidia.

Damowerko commented 8 months ago

@Baenimyr Good that the device plugin supports MPS now. The problem is that it does not scale dynamically. Of course, NOS could use the NVIDIA plugin now. However, with the NVIDIA DRA driver on the horizon, it does not make sense for me personally to use NOS.