hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.87k stars 1.95k forks source link

`resource.constraints` with non-exclusive device access #23703

Closed csicar closed 2 months ago

csicar commented 2 months ago

Proposal

Give resource constraints / affinities the ability to be be non-exclusive

Use-cases

Let's say I'd like to use a GPU with docker containers. My resource section may look like this:

 resources {
    …
    device "nvidia/gpu" {
        constraint {
            attribute = "${device.attr.memory}"
            operator  = ">="
            value     = "2 GiB"
        }
    }
}

I basically want to say, that my job (container) requires at least 2 GiB of free VRAM. Instead, ATM, nomad interprets this as: "Give the job exclusive access to an nvidia gpu with at least 2 GB of memory" Even if multiple jobs could be scheduled on a single GPU.

Attempted Solutions

tgross commented 2 months ago

Hi @csicar! For this to happen we'd need to model internal-to-device resources in the same way that we model these resources for whole hosts. Our current API for devices (ref Device Plugins) only supports whole devices. Fractional devices seems unlikely to make sense for other devices that aren't GPUs, and would be a fairly large lift in the scheduler to support.

Some workarounds for you:

I had a chat with folks on the team about this problem, and as a result of that discussion I'm going to close this issue out as something we won't move forward on.

csicar commented 2 months ago

Thank you for the quick response and considering this and giving me clarity about what nomad will do.

I don't think your proposed workarounds are sufficient for solving the problem. The main reason: Dynamically allocating GPU tasks and letting the CUDA engine schedule the task as needed is essential for effectively utilizing the expensive GPUs, so statically slicing GPUs won't cut it.

I think for our use case, using custom labels and doing the allocation manually, prior to scheduling may be the workaround we end up using.