`resource.constraints` with non-exclusive device access

hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.

https://www.nomadproject.io/

Other

14.87k stars 1.95k forks source link

`resource.constraints` with non-exclusive device access #23703

Closed csicar closed 2 months ago

csicar commented 2 months ago

Proposal

Give resource constraints / affinities the ability to be be non-exclusive

Use-cases

Let's say I'd like to use a GPU with docker containers. My resource section may look like this:

 resources {
    …
    device "nvidia/gpu" {
        constraint {
            attribute = "${device.attr.memory}"
            operator  = ">="
            value     = "2 GiB"
        }
    }
}

I basically want to say, that my job (container) requires at least 2 GiB of free VRAM. Instead, ATM, nomad interprets this as: "Give the job exclusive access to an nvidia gpu with at least 2 GB of memory" Even if multiple jobs could be scheduled on a single GPU.

Attempted Solutions

use task.resource.affinity -> same problem
use task.affinity -> no access to device attributes AFAIK
use task.affinity with custom node labels -> not useful if multiple GPUs are installed on one node
use fractional task.resource.device.count -> nomad verifier does not allow this

tgross commented 2 months ago

Hi @csicar! For this to happen we'd need to model internal-to-device resources in the same way that we model these resources for whole hosts. Our current API for devices (ref Device Plugins) only supports whole devices. Fractional devices seems unlikely to make sense for other devices that aren't GPUs, and would be a fairly large lift in the scheduler to support.

Some workarounds for you:

Use a device driver that slices up a device into multiple IDs.
Switch to using MIG to slice up an A100 or H100 GPU into up to 7 mini GPUs (not really what you're asking for, and not currently supported by our Nvidia driver but that's in the works).

I had a chat with folks on the team about this problem, and as a result of that discussion I'm going to close this issue out as something we won't move forward on.

csicar commented 2 months ago

Thank you for the quick response and considering this and giving me clarity about what nomad will do.

I don't think your proposed workarounds are sufficient for solving the problem. The main reason: Dynamically allocating GPU tasks and letting the CUDA engine schedule the task as needed is essential for effectively utilizing the expensive GPUs, so statically slicing GPUs won't cut it.

I think for our use case, using custom labels and doing the allocation manually, prior to scheduling may be the workaround we end up using.