Nomad plugin nvidia-gpu does not detect multi-instance GPUs

aneutron commented 2 years ago

Nomad version

Nomad v1.2.0-dev (6d35e2fb58663aa2ad8b8f47459eff342901e72a)

(The nightly build that fixed hashicorp/nomad#11342)

Operating system and Environment details

# cat /etc/redhat-release
Red Hat Enterprise Linux release 8.4 (Ootpa)
# cat /proc/cpuinfo | grep EPYC | uniq
model name      : AMD EPYC 7763 64-Core Processor
# cat /proc/meminfo | grep -i Memtot
MemTotal:       527815668 kB
# nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-F)

Issue

Hello Again !

First of all, thank you very much for the amazing response time and time to fix the preemption problem in hashicorp/nomad#11342. While I was testing your product, I wondered about its compatibility with the Multi-Instance GPU solution of nVidia.

In a nutshell, it allows for us to physically partition a big GPU into more bite-sized GPUs. That can be immensly useful for numerous use cases (e.g. hosting multiple jobs on the same chonkster of a GPU).

When MIG is enabled on a GPU, you cannot use the GPU as a resource (i.e. You can only use the MIG instances created on the GPU). For example, in my setup we have 4 A100 GPUs, of which GPU3 has MIG enabled. I've went ahead and created 2 half-GPUs (basically). This is the nvidia-smi -L:

GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-P)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-P)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-P)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-X-Y-Z-T-P)
  MIG 3g.20gb     Device  0: (UUID: MIG-X-Y-Z-T-P)
  MIG 3g.20gb     Device  1: (UUID: MIG-X-Y-Z-T-P)

Yet when I run nomad node status {my_node_id}, this is the output of the GPU resources part:

Device Resource Utilization
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-X-Y-Z-T-P]  3 / 40536 MiB
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-X-Y-Z-T-P]  0 / 40536 MiB
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-X-Y-Z-T-P]  39340 / 40536 MiB
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-X-Y-Z-T-P]  20 / 40536 MiB

Now this is problematic for two reasons:

Should nomad schedule a workload on GPU3, then that workload will fail as the resource isn't effectively available.
Nomad cannot recognize MIG instances, although they are basically distinct resources, and that in turn restricts us to for example allocating a 40GB GPU to a job that otherwise would only need 10GB VRAM and 10% of the CUDA cores of the beefy GPU.

Now MIG instances are usable (almost as fully-fledged GPUs, or more specifically CUDA devices), so they're perfectly compatible with Docker workloads using the nVidia runtime (e.g. instead of using --gpus='"device=0"' you'd use --gpus='"device=0:0"' to reference the first MIG instance on the first GPU).

Reproduction steps

Enable MIG by following the nVidia guide.
Relaunch the Nomad agent.

Expected Result

See the GPU with MIG enabled disappear from the available resources (as its no longer usable)
See the new created MIG instances appear as resources that can be assigned to Docker jobs.

Actual Result

The GPU w/ MIG enabled is still there in the available resources.
No MIG instances are picked up.

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented 2 years ago

Hi @aneutron! MIG wasn't available when the Nvidia driver was first developed, and I'll be honest and say it hasn't seen a lot of investment as we don't have too many users making feature requests.

That being said, let's dig into this idea a bit...

Enable MIG by following the nVidia guide. Relaunch the Nomad agent. ... See the GPU with MIG enabled disappear from the available resources (as its no longer usable)

When a Nomad agent is restarted, the workloads are left running. Ignoring Nomad for a moment, if we enable MIG while a workload has the device mounted, what's the expected behavior there? If workloads are expected to stay running, then do we need to be updating the agent's fingerprint of the GPU with the MIG option whether or not we restart Nomad?

Are there security implications to using MIG (above and beyond the usual security implications of exposing the GPU to workloads)? What does this look like outside of Docker with our exec driver (or worse, the raw_exec driver)?

As an aside, as of the 1.2.0 release which should be coming this week or so, the Nvidia device driver is externalized rather than being bundled with Nomad (see https://github.com/hashicorp/nomad/pull/10796). I'm going to move this over to https://github.com/hashicorp/nomad-device-nvidia/issues, which is where all the Nvidia GPU device driver work is done these days.

aneutron commented 2 years ago

Hey @tgross,

Sorry for putting the issue on the wrong project. I'm not an expert on all MIG / CUDA matters myself, but I can perhaps offer some points to help reasoning about this issue:

(I'm basing most of this on what I understood from the documentation over at nvidia's page for MIG)

Normally, you cannot enable MIG on a device if that device is being used by any process whatsoever. And once the MIG instances are created, they cannot be removed until they have no CUDA workload on them. So if there is an active CUDA workload, the operator would have to either vacate the node or stop the job to enable / create MIG instances.

That however is not sufficient, as we could potentially have jobs that don't actively use CUDA, and hence can have the GPU or the MIG instance swiped right from under their feet.

To avoid any problems, we can perhaps remove the GPU when MIG is enabled, and mark all the jobs that depend on it as "failed" or something similar, forcing nomad to reschedule them (I'm not well versed in nomad, sorry if I'm being incoherent or irrelevant).

Maybe an agent local periodic check for any changes in the available devices that detects any changes and marks the jobs as failed ?
For the security implications, well as far as I can tell, MIG instances are physically separated (different compute clusters, different memory lanes even if I understood correctly, only the boundaries are set by software), so from an inter-job interference standpoint, they are quite isolated.

Now if we are talking about access to these devices and limiting it, I wouldn't be able to give you a clear picture or an idea straight up. For Docker workloads, the solution is pretty simple as you can use the NVIDIA runtime's facilities to only pass one or more MIG instances to a workload. However, for exec and raw_exec workloads, it's a bit more nuanced than that:

NVIDIA has thought of it, and the documentation mentions that cgroups can be used to finely control who accesses what MIG and devices. It looks very doable, I haven't tested it however, since it's not my particular workload.

But since Nomad already has quite a bit of support for cgroups, I imagine (perhaps naively) that integrating the same measures for MIG wouldn't be trivial, but not impossible either.

LKNSI commented 1 year ago

Sorry for stale bump, with the cost/scarcity of A100/H100's this is starting to become an issue where its getting harder to avoid needing k8s or cloud containers to have more than one job assigned to a single GPU.

e.g. (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html) - Last Updated: 2023-06-10

Appreciate that my comment doesn't add any value or support to helping this become a reality, but would greatly appreciate this be given another look - as its extremely wasteful to need to use a full GPU for a task that needs only 1-10gb.

isidentical commented 8 months ago

This is also one of the bigger blockers for us, and I've decided to take a stab at it to at least experiment internally but I'd appreciate some forward-guiadance if this is something that can be eventually upstreamed. One of the things that this PR doesn't include whatsoever is the support for enabling/disabling MIG while there are workloads running or changing the MIG mode dynamically before the fingerprinting. I think it is an extremely rare occurrence that would complicate the initial implementation a lot, and the value this adds on just its own is already very big for us. We are going to start to slowly roll this out internally to see if there are any edge cases that we haven't fixed, but if anyone wants to take a look at it and let me know if it is right direction, that'd be super helpful!

mikenomitch commented 7 months ago

Hey @isidentical, thanks for the PR. I just noticed that this slipped thru the cracks and nobody has reviewed it yet. I'll bump it on our internal channels so somebody takes a look soon. Sorry for the delay!

schmichael commented 7 months ago

One of the things that this PR doesn't include whatsoever is the support for enabling/disabling MIG while there are workloads running or changing the MIG mode dynamically before the fingerprinting. I think it is an extremely rare occurrence that would complicate the initial implementation a lot,

I think treating the device as modal, GPU or MIG, is fine as long as we clearly document that behavior. Any operator capable of altering a node's configuration should be capable of draining it. It seems like we can consider "graceful migration" a future enhancement.

hashicorp / nomad-device-nvidia