Open aneutron opened 2 years ago
Hi @aneutron! MIG wasn't available when the Nvidia driver was first developed, and I'll be honest and say it hasn't seen a lot of investment as we don't have too many users making feature requests.
That being said, let's dig into this idea a bit...
Enable MIG by following the nVidia guide. Relaunch the Nomad agent. ... See the GPU with MIG enabled disappear from the available resources (as its no longer usable)
When a Nomad agent is restarted, the workloads are left running. Ignoring Nomad for a moment, if we enable MIG while a workload has the device mounted, what's the expected behavior there? If workloads are expected to stay running, then do we need to be updating the agent's fingerprint of the GPU with the MIG option whether or not we restart Nomad?
Are there security implications to using MIG (above and beyond the usual security implications of exposing the GPU to workloads)? What does this look like outside of Docker with our exec
driver (or worse, the raw_exec
driver)?
As an aside, as of the 1.2.0 release which should be coming this week or so, the Nvidia device driver is externalized rather than being bundled with Nomad (see https://github.com/hashicorp/nomad/pull/10796). I'm going to move this over to https://github.com/hashicorp/nomad-device-nvidia/issues, which is where all the Nvidia GPU device driver work is done these days.
Hey @tgross,
Sorry for putting the issue on the wrong project. I'm not an expert on all MIG / CUDA matters myself, but I can perhaps offer some points to help reasoning about this issue:
(I'm basing most of this on what I understood from the documentation over at nvidia's page for MIG)
Normally, you cannot enable MIG on a device if that device is being used by any process whatsoever. And once the MIG instances are created, they cannot be removed until they have no CUDA workload on them. So if there is an active CUDA workload, the operator would have to either vacate the node or stop the job to enable / create MIG instances.
That however is not sufficient, as we could potentially have jobs that don't actively use CUDA, and hence can have the GPU or the MIG instance swiped right from under their feet.
To avoid any problems, we can perhaps remove the GPU when MIG is enabled, and mark all the jobs that depend on it as "failed" or something similar, forcing nomad to reschedule them (I'm not well versed in nomad, sorry if I'm being incoherent or irrelevant).
Maybe an agent local periodic check for any changes in the available devices that detects any changes and marks the jobs as failed ?
For the security implications, well as far as I can tell, MIG instances are physically separated (different compute clusters, different memory lanes even if I understood correctly, only the boundaries are set by software), so from an inter-job interference standpoint, they are quite isolated.
Now if we are talking about access to these devices and limiting it, I wouldn't be able to give you a clear picture or an idea straight up. For Docker workloads, the solution is pretty simple as you can use the NVIDIA runtime's facilities to only pass one or more MIG instances to a workload. However, for exec
and raw_exec
workloads, it's a bit more nuanced than that:
NVIDIA has thought of it, and the documentation mentions that cgroups can be used to finely control who accesses what MIG and devices. It looks very doable, I haven't tested it however, since it's not my particular workload.
But since Nomad already has quite a bit of support for cgroups
, I imagine (perhaps naively) that integrating the same measures for MIG wouldn't be trivial, but not impossible either.
Sorry for stale bump, with the cost/scarcity of A100/H100's this is starting to become an issue where its getting harder to avoid needing k8s or cloud containers to have more than one job assigned to a single GPU.
e.g. (https://docs.nvidia.com/datacenter/cloud-native/kubernetes/mig-k8s.html) - Last Updated: 2023-06-10
Appreciate that my comment doesn't add any value or support to helping this become a reality, but would greatly appreciate this be given another look - as its extremely wasteful to need to use a full GPU for a task that needs only 1-10gb.
This is also one of the bigger blockers for us, and I've decided to take a stab at it to at least experiment internally but I'd appreciate some forward-guiadance if this is something that can be eventually upstreamed. One of the things that this PR doesn't include whatsoever is the support for enabling/disabling MIG while there are workloads running or changing the MIG mode dynamically before the fingerprinting. I think it is an extremely rare occurrence that would complicate the initial implementation a lot, and the value this adds on just its own is already very big for us. We are going to start to slowly roll this out internally to see if there are any edge cases that we haven't fixed, but if anyone wants to take a look at it and let me know if it is right direction, that'd be super helpful!
Hey @isidentical, thanks for the PR. I just noticed that this slipped thru the cracks and nobody has reviewed it yet. I'll bump it on our internal channels so somebody takes a look soon. Sorry for the delay!
One of the things that this PR doesn't include whatsoever is the support for enabling/disabling MIG while there are workloads running or changing the MIG mode dynamically before the fingerprinting. I think it is an extremely rare occurrence that would complicate the initial implementation a lot,
I think treating the device as modal, GPU or MIG, is fine as long as we clearly document that behavior. Any operator capable of altering a node's configuration should be capable of draining it. It seems like we can consider "graceful migration" a future enhancement.
Nomad version
Nomad v1.2.0-dev (6d35e2fb58663aa2ad8b8f47459eff342901e72a)
(The nightly build that fixed hashicorp/nomad#11342)
Operating system and Environment details
Issue
Hello Again !
First of all, thank you very much for the amazing response time and time to fix the preemption problem in hashicorp/nomad#11342. While I was testing your product, I wondered about its compatibility with the Multi-Instance GPU solution of nVidia.
In a nutshell, it allows for us to physically partition a big GPU into more bite-sized GPUs. That can be immensly useful for numerous use cases (e.g. hosting multiple jobs on the same chonkster of a GPU).
When MIG is enabled on a GPU, you cannot use the GPU as a resource (i.e. You can only use the MIG instances created on the GPU). For example, in my setup we have 4 A100 GPUs, of which GPU3 has MIG enabled. I've went ahead and created 2 half-GPUs (basically). This is the
nvidia-smi -L
:Yet when I run
nomad node status {my_node_id}
, this is the output of the GPU resources part:Now this is problematic for two reasons:
Now MIG instances are usable (almost as fully-fledged GPUs, or more specifically CUDA devices), so they're perfectly compatible with Docker workloads using the nVidia runtime (e.g. instead of using --gpus='"device=0"' you'd use --gpus='"device=0:0"' to reference the first MIG instance on the first GPU).
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)