Open kcajf opened 5 years ago
Is this being worked on? It is a very important future for our use case.
@D4GGe We had a similar requirement. We solved it by running the job via raw exec driver in Nomad (via docker-compose). In the docker-compose template, we run the desired number of containers (via scale
param). This way, we bypass this issue. Although, when running via raw exec, you need to worry about handling SIGTERM.
Up to this feature, we need it in production with machine learning tasks.
We use one GPU, we use it in different tasks. Please implement the GPU reuse function.
Any updates on this?
Bump we run with an ugly mod on the GPU driver atm would love to be able to do this!
We'd also like to make use of this feature. @appland-streaming how did you workaround this constraint?
@chinmay185 we're considering using your same workaround (raw exec driver with docker-compose). Has it fared well over the years? any gotchas to be aware of?
This does not look like a difficult issue to resolve ... hard to believe it took 3 years and got nowhere ...
Hmmm I see why this is "difficult" to move forward ... this part ...
func (d *deviceAllocator) AssignDevice(ask *structs.RequestedDevice) (out *structs.AllocatedDeviceResource, score float64, err error) {
// Try to hot path
if len(d.Devices) == 0 {
return nil, 0.0, fmt.Errorf("no devices available")
}
if ask.Count == 0 {
return nil, 0.0, fmt.Errorf("invalid request of zero devices")
}
It is so tightly coupled with any other devices ... There is no special entry for GPU ... and I guess the need for schedule multiple tasks on a single GPU is not strong enough to make a relatively big change on this API ...
@chinmay185 we're considering using your same workaround (raw exec driver with docker-compose). Has it fared well over the years? any gotchas to be aware of?
Yes, it's worked very well. The only things you need to be aware of/handle is the exit status (SIGTERM). We docker-compose down
in the script when we receive the SIGTERM. But apart from that, it worked pretty well. It's been in prod for more than 2 years now.
@tgross would be amazing if we didn't have to use raw_exec to work around this. Thank you for looking into it! Let me know if I can be of any assistance 🙏
I am planning to just remove the ask.Count < 0 check for temporary workaround, not sure it will work or not... (Using docker-compose would cause extra trouble in my case as we are running websocket servers with GPU, not pure computation workloads...) And then perhaps develop an appropriate way to get around the GPU count issue. Brainstorming is more than welcome!
Thank you for looking into it! Let me know if I can be of any assistance
For clarity, I've only marked it as needs roadmapping because I noticed the issue wasn't classified correctly.
As noted, we don't have a good way of doing this kind of thing without breaking backwards compatibility with the existing devices API. A lot of implementation work would need to land over in the https://github.com/hashicorp/nomad-device-nvidia driver. We'd be happy to review PRs for this sort of thing but I'm going to be honest and say (in case it wasn't obvious) that this isn't a path that's highly prioritized for us right now.
A lot of implementation work would need to land over in the https://github.com/hashicorp/nomad-device-nvidia driver
Why is this ? @tgross would you please shed some light on your comment? I took a look on the device plugin part I believe to allow multiple jobs sharing a single GPU should not have involved the driver part, no ? Allocation failure happens on the scheduler
, although I got no luck for now by tweaking the code in scheduler/device.go
...
Scheduling would need to get updated first, sure. But wouldn't the plugin need to know how to reserve a portion of the GPU as well? (i.e. handle changes to the Reserve API)
Scheduling would need to get updated first, sure. But wouldn't the plugin need to know how to reserve a portion of the GPU as well? (i.e. handle changes to the Reserve API)
Oh yea, that's actually more advanced than what I initially expected. I guess in short term we or I do not care much about the reservation, but rather let the applications to try their luck on this.
On the other hand ... the structs/device.go
also checks for collisions ... for the allocation ... this truly involves tremendous work to make this happen ...
Wait, ... the Reserve API assigns devices to the task environment, for GPU, IMHO it does not seem to be feasible to reserve a portion of it anyway, Nvidia does not support it IIUC. For server with on GPU card, then there is only one device ID (GPU UUID) eligible to be assigned to the docker driver. What we need is just to reuse this ID for multiple containers. Like what docker Swarm
would do.
IMHO it does not seem to be feasible to reserve a portion of it anyway, Nvidia does not support it IIUC.
I'm going to admit I'm not a GPGPU guru. But if you can't reserve portions of the card anyways, what does multiple "reservations" actually get you? You could add client node metadata saying "hey there's a card over here" and then use a constraint
to make sure jobs that need the GPU land on that node, and then mount it as a device. Or am I missing something?
I'm going to admit I'm not a GPGPU guru. But if you can't reserve portions of the card anyways, what does multiple "reservations" actually get you? You could add client node metadata saying "hey there's a card over here" and then use a
constraint
to make sure jobs that need the GPU land on that node, and then mount it as a device. Or am I missing something?
@tgross Wow that is a great idea and sounds like exactly the type of workaround we've been looking for! And is much better than using raw_exec with docker/docker-compose. I didn't realize this was possible, the ability to mount the GPU as a device is the step I was missing.
To confirm, mounting the GPU as a device will ensure that the GPU is visible in our GPU job containers and will enable us to schedule multiple GPU jobs on the same node, avoiding the 1-gpu-per-task constraint that we hit when using the device "nvidia/gpu"
stanza with the nvidia plugin. Is my understanding correct?
@tgross Thx for your reply and I actually have tried the approach that mounting the GPU as a device but it did not work out ... @johnnyplaydrums fyi.
I am not sure if I did it the correct way or not but my previous try gave me an error saying driver is not supported or so. What I did is like mount all nvidia
prefix files under /dev
to the container. Not sure what I missed. On the other hand, using raw docker cli we need to specify --gpus
instead, not --device
Do you get any luck on this approach? @johnnyplaydrums
To confirm, mounting the GPU as a device will ensure that the GPU is visible in our GPU job containers and will enable us to schedule multiple GPU jobs on the same node, avoiding the 1-gpu-per-task constraint that we hit when using the
device "nvidia/gpu"
stanza with the nvidia plugin. Is my understanding correct?
That should work unless I've missed something. Unfortunately I don't have an Nvidia GPU handy (or good development setup for a cloud machine with one at the moment). @imcom seems to have run into some issues; it might help if you share your jobspec @imcom
@tgross I reverted my previous jobspec so I do not have it. I will try to replicate the issue tomorrow and then share my spec. It would be really nice if the mount option would work.
I got this from our vulkan-based app
FATAL: failed to create instance (ERROR_INCOMPATIBLE_DRIVER)
FATAL: failed to create instance (ERROR_INCOMPATIBLE_DRIVER)
FATAL: failed to create instance (ERROR_INCOMPATIBLE_DRIVER)
And my job spec is as follows:
job "myapp" {
datacenters = ["dc1"]
type = "service"
group "myapp" {
count = 3
network {
port "http" {}
port "grpc" {}
}
task "myapp_proxy" {
lifecycle {
hook = "poststart"
sidecar = true
}
env {
DEBUG="true"
}
driver = "docker"
config {
image = "myapp:demo"
ports = ["http"]
args = [
"serve",
"--config",
"/proxy.yaml",
"--port",
"${NOMAD_PORT_http}",
"--export-metrics",
"--conn-timeout=0",
"-t"
]
}
resources {
cpu = 10000
memory = 2000
}
}
task "myapp" {
driver = "docker"
env {
NVIDIA_DRIVER_CAPABILITIES = "all"
}
config {
image = "gpu/myapp:demo"
command = "mygpuapp"
ports = ["grpc"]
devices = [
{
host_path = "/dev/nvidia0"
},
{
host_path = "/dev/nvidiactl"
},
{
host_path = "/dev/nvidia-modeset"
},
{
host_path = "/dev/nvidia-uvm"
},
{
host_path = "/dev/nvidia-uvm-tools"
}
]
args = [
"--headless",
"--ipc_listen_addr",
"0.0.0.0:${NOMAD_PORT_grpc}",
]
}
resources {
cpu = 10000
memory = 12000
}
}
}
}
Honestly speaking, I do not know whether my docker devices config is correct or not. Any help would be greatly appreciated!
Is there any progress on this issue?
Hi @alexgornov, there is not update unfortunately. If/when we do, a member of the Nomad team will comment on the issue.
You can use privileged container, so it will be able to access all gpus on the device, if that is helpful
Would really love to have this feature. Currently need to hack it in and assign gpus manually, which is not so nice. This missing feature is the only reason why I am considering switching to k8s.
Could we perhaps get a work around in the meantime that would allow us to simply overprovision? Essentially I can imagine telling Nomad: "I need a GPU, but don't worry about fairly scheduling task across it". I could then use node affinity to manually distribute my workload. In reality, we actually just have a single machine with a A100 that we want to run all CUDA-compatible services on.
What worked for me, just as a workaround, is to not specify the GPU using Nomad's resources stanza and, rather than that, configure Docker to run all the containers using GPU when available. The jobs that run in that machine can be controlled using node_pool
so I make sure that only jobs needed GPU fall in there. Basically, my /etc/docker/dameon.json
looks like this:
{
"live-restore": true,
"runtimes": {
"nvidia": {
"args": [],
"path": "nvidia-container-runtime"
}
},
"default-runtime": "nvidia"
}
Of course, you need to install the nvidia-container-runtime beforehand.
Hi, Is there any progress on this issue?
Currently it seems like when a GPU is allocated to a job, that GPU is reserved exclusively by that job for the duration of the job. This is a real problem, since on large GPUs (e.g. a 32GB Tesla) you often want to run several smaller processes side-by-side, each using a subset of the GPU memory and compute. I found a previous reference to this issue here: https://groups.google.com/forum/#!topic/nomad-tool/x5fYGt7bWdk, but it looks like nothing came of it.
Being able to schedule based on fine-grained GPU resources (even if those limits are not enforced, and are just used for scheduling / indicatively) would be a very valuable feature.