hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.94k stars 1.96k forks source link

Feature request: allow multiple jobs to share a single GPU #6708

Open kcajf opened 5 years ago

kcajf commented 5 years ago

Currently it seems like when a GPU is allocated to a job, that GPU is reserved exclusively by that job for the duration of the job. This is a real problem, since on large GPUs (e.g. a 32GB Tesla) you often want to run several smaller processes side-by-side, each using a subset of the GPU memory and compute. I found a previous reference to this issue here: https://groups.google.com/forum/#!topic/nomad-tool/x5fYGt7bWdk, but it looks like nothing came of it.

Being able to schedule based on fine-grained GPU resources (even if those limits are not enforced, and are just used for scheduling / indicatively) would be a very valuable feature.

D4GGe commented 4 years ago

Is this being worked on? It is a very important future for our use case.

chinmay185 commented 3 years ago

@D4GGe We had a similar requirement. We solved it by running the job via raw exec driver in Nomad (via docker-compose). In the docker-compose template, we run the desired number of containers (via scale param). This way, we bypass this issue. Although, when running via raw exec, you need to worry about handling SIGTERM.

kenoma commented 3 years ago

Up to this feature, we need it in production with machine learning tasks.

montekristo1946 commented 3 years ago

We use one GPU, we use it in different tasks. Please implement the GPU reuse function.

kenoma commented 3 years ago

Any updates on this?

appland-streaming commented 2 years ago

Bump we run with an ugly mod on the GPU driver atm would love to be able to do this!

RajashreeRavi commented 2 years ago

We'd also like to make use of this feature. @appland-streaming how did you workaround this constraint?

johnnyplaydrums commented 2 years ago

@chinmay185 we're considering using your same workaround (raw exec driver with docker-compose). Has it fared well over the years? any gotchas to be aware of?

imcom commented 2 years ago

This does not look like a difficult issue to resolve ... hard to believe it took 3 years and got nowhere ...

imcom commented 2 years ago

Hmmm I see why this is "difficult" to move forward ... this part ...

func (d *deviceAllocator) AssignDevice(ask *structs.RequestedDevice) (out *structs.AllocatedDeviceResource, score float64, err error) {
    // Try to hot path
    if len(d.Devices) == 0 {
        return nil, 0.0, fmt.Errorf("no devices available")
    }
    if ask.Count == 0 {
        return nil, 0.0, fmt.Errorf("invalid request of zero devices")
    }

It is so tightly coupled with any other devices ... There is no special entry for GPU ... and I guess the need for schedule multiple tasks on a single GPU is not strong enough to make a relatively big change on this API ...

chinmay185 commented 2 years ago

@chinmay185 we're considering using your same workaround (raw exec driver with docker-compose). Has it fared well over the years? any gotchas to be aware of?

Yes, it's worked very well. The only things you need to be aware of/handle is the exit status (SIGTERM). We docker-compose down in the script when we receive the SIGTERM. But apart from that, it worked pretty well. It's been in prod for more than 2 years now.

johnnyplaydrums commented 2 years ago

@tgross would be amazing if we didn't have to use raw_exec to work around this. Thank you for looking into it! Let me know if I can be of any assistance 🙏

imcom commented 2 years ago

I am planning to just remove the ask.Count < 0 check for temporary workaround, not sure it will work or not... (Using docker-compose would cause extra trouble in my case as we are running websocket servers with GPU, not pure computation workloads...) And then perhaps develop an appropriate way to get around the GPU count issue. Brainstorming is more than welcome!

tgross commented 2 years ago

Thank you for looking into it! Let me know if I can be of any assistance

For clarity, I've only marked it as needs roadmapping because I noticed the issue wasn't classified correctly.

As noted, we don't have a good way of doing this kind of thing without breaking backwards compatibility with the existing devices API. A lot of implementation work would need to land over in the https://github.com/hashicorp/nomad-device-nvidia driver. We'd be happy to review PRs for this sort of thing but I'm going to be honest and say (in case it wasn't obvious) that this isn't a path that's highly prioritized for us right now.

imcom commented 2 years ago

A lot of implementation work would need to land over in the https://github.com/hashicorp/nomad-device-nvidia driver

Why is this ? @tgross would you please shed some light on your comment? I took a look on the device plugin part I believe to allow multiple jobs sharing a single GPU should not have involved the driver part, no ? Allocation failure happens on the scheduler, although I got no luck for now by tweaking the code in scheduler/device.go ...

tgross commented 2 years ago

Scheduling would need to get updated first, sure. But wouldn't the plugin need to know how to reserve a portion of the GPU as well? (i.e. handle changes to the Reserve API)

imcom commented 2 years ago

Scheduling would need to get updated first, sure. But wouldn't the plugin need to know how to reserve a portion of the GPU as well? (i.e. handle changes to the Reserve API)

Oh yea, that's actually more advanced than what I initially expected. I guess in short term we or I do not care much about the reservation, but rather let the applications to try their luck on this.

On the other hand ... the structs/device.go also checks for collisions ... for the allocation ... this truly involves tremendous work to make this happen ...

imcom commented 2 years ago

Wait, ... the Reserve API assigns devices to the task environment, for GPU, IMHO it does not seem to be feasible to reserve a portion of it anyway, Nvidia does not support it IIUC. For server with on GPU card, then there is only one device ID (GPU UUID) eligible to be assigned to the docker driver. What we need is just to reuse this ID for multiple containers. Like what docker Swarm would do.

tgross commented 2 years ago

IMHO it does not seem to be feasible to reserve a portion of it anyway, Nvidia does not support it IIUC.

I'm going to admit I'm not a GPGPU guru. But if you can't reserve portions of the card anyways, what does multiple "reservations" actually get you? You could add client node metadata saying "hey there's a card over here" and then use a constraint to make sure jobs that need the GPU land on that node, and then mount it as a device. Or am I missing something?

johnnyplaydrums commented 2 years ago

I'm going to admit I'm not a GPGPU guru. But if you can't reserve portions of the card anyways, what does multiple "reservations" actually get you? You could add client node metadata saying "hey there's a card over here" and then use a constraint to make sure jobs that need the GPU land on that node, and then mount it as a device. Or am I missing something?

@tgross Wow that is a great idea and sounds like exactly the type of workaround we've been looking for! And is much better than using raw_exec with docker/docker-compose. I didn't realize this was possible, the ability to mount the GPU as a device is the step I was missing.

To confirm, mounting the GPU as a device will ensure that the GPU is visible in our GPU job containers and will enable us to schedule multiple GPU jobs on the same node, avoiding the 1-gpu-per-task constraint that we hit when using the device "nvidia/gpu" stanza with the nvidia plugin. Is my understanding correct?

imcom commented 2 years ago

@tgross Thx for your reply and I actually have tried the approach that mounting the GPU as a device but it did not work out ... @johnnyplaydrums fyi.

I am not sure if I did it the correct way or not but my previous try gave me an error saying driver is not supported or so. What I did is like mount all nvidia prefix files under /dev to the container. Not sure what I missed. On the other hand, using raw docker cli we need to specify --gpus instead, not --device

Do you get any luck on this approach? @johnnyplaydrums

tgross commented 2 years ago

To confirm, mounting the GPU as a device will ensure that the GPU is visible in our GPU job containers and will enable us to schedule multiple GPU jobs on the same node, avoiding the 1-gpu-per-task constraint that we hit when using the device "nvidia/gpu" stanza with the nvidia plugin. Is my understanding correct?

That should work unless I've missed something. Unfortunately I don't have an Nvidia GPU handy (or good development setup for a cloud machine with one at the moment). @imcom seems to have run into some issues; it might help if you share your jobspec @imcom

imcom commented 2 years ago

@tgross I reverted my previous jobspec so I do not have it. I will try to replicate the issue tomorrow and then share my spec. It would be really nice if the mount option would work.

imcom commented 2 years ago

I got this from our vulkan-based app

FATAL: failed to create instance (ERROR_INCOMPATIBLE_DRIVER)
FATAL: failed to create instance (ERROR_INCOMPATIBLE_DRIVER)
FATAL: failed to create instance (ERROR_INCOMPATIBLE_DRIVER)

And my job spec is as follows:

job "myapp" {
  datacenters = ["dc1"]
  type = "service"

  group "myapp" {
    count = 3
    network {
        port "http" {}
        port "grpc" {}
    }

    task "myapp_proxy" {
      lifecycle {
        hook = "poststart"
        sidecar = true
      }

      env {
        DEBUG="true"
      }

      driver = "docker"
      config {
        image = "myapp:demo"
        ports = ["http"]
        args = [
        "serve",
        "--config",
        "/proxy.yaml",
        "--port",
        "${NOMAD_PORT_http}",
        "--export-metrics",
        "--conn-timeout=0",
        "-t"
      ]
      }
      resources {
        cpu = 10000
        memory = 2000
      }
    }

    task "myapp" {
      driver = "docker"

      env {
        NVIDIA_DRIVER_CAPABILITIES = "all"
      }

      config {
        image = "gpu/myapp:demo"
        command = "mygpuapp"
        ports = ["grpc"]
        devices = [
            {
                host_path = "/dev/nvidia0"
            },
            {
                host_path = "/dev/nvidiactl"
            },
            {
                host_path = "/dev/nvidia-modeset"
            },
            {
                host_path = "/dev/nvidia-uvm"
            },
            {
                host_path = "/dev/nvidia-uvm-tools"
            }
        ]
        args = [
        "--headless",
        "--ipc_listen_addr",
        "0.0.0.0:${NOMAD_PORT_grpc}",
      ]
      }

      resources {
        cpu = 10000
        memory = 12000
      }
    }
  }
}

Honestly speaking, I do not know whether my docker devices config is correct or not. Any help would be greatly appreciated!

alexgornov commented 1 year ago

Is there any progress on this issue?

jrasell commented 1 year ago

Hi @alexgornov, there is not update unfortunately. If/when we do, a member of the Nomad team will comment on the issue.

Eyald6 commented 1 year ago

You can use privileged container, so it will be able to access all gpus on the device, if that is helpful

illyakaynov commented 1 year ago

Would really love to have this feature. Currently need to hack it in and assign gpus manually, which is not so nice. This missing feature is the only reason why I am considering switching to k8s.

ocharles commented 1 year ago

Could we perhaps get a work around in the meantime that would allow us to simply overprovision? Essentially I can imagine telling Nomad: "I need a GPU, but don't worry about fairly scheduling task across it". I could then use node affinity to manually distribute my workload. In reality, we actually just have a single machine with a A100 that we want to run all CUDA-compatible services on.

albertoperdomo2 commented 1 year ago

What worked for me, just as a workaround, is to not specify the GPU using Nomad's resources stanza and, rather than that, configure Docker to run all the containers using GPU when available. The jobs that run in that machine can be controlled using node_pool so I make sure that only jobs needed GPU fall in there. Basically, my /etc/docker/dameon.json looks like this:

{
  "live-restore": true,
  "runtimes": {
    "nvidia": {
      "args": [],
      "path": "nvidia-container-runtime"
    }
  },
  "default-runtime": "nvidia"
}

Of course, you need to install the nvidia-container-runtime beforehand.

leoagafonov commented 11 months ago

Hi, Is there any progress on this issue?