Open geekodour opened 4 months ago
Even after having nvidia-container-toolkit in place, getting:
Failed to start container 26aa869a48b55da90438c56c5b41a831a270107da805c123786e831ee8b2615f: API error (500): failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/dcfl52x9s397zkky85kass0liyky1i57-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown
So I experimented with a couple of different configurations, what seems to work for now:
virtualisation.docker.enableNvidia = true;
(This is being deprecated here: https://github.com/NixOS/nixpkgs/issues/322400)nomad-docker
overlay from nix:23.11
as described here: https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/3So nomad doesn't seem to be working with the latest nvidia-docker configurations that are set in nix. What they're moving towards is using CDI and deprecating usage of using runtime:nvidia
which is something I think nomad
makes use of.
I think there are few actions out of this:
virtualisation.docker.enableNvidia
till we fix this because that's the only straw that seems to be holding things together for now when it comes to making it work with nomad-device-nvidia
nomad
to use CDI instead of using runtime:nvidia
I'll be happy to work on changes on nomad side of things if that's the way we want to go forward.
cc: @tgross
I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly.
nvidia-container-toolkit
is also installed and am able to access gpu from container directly usingdocker run
but unable to access from nomad.what happened
I am running the following as a debug job:
Running:
nomad job run debug.hcl
,I think the issue is more around
docker <-> nvidia-container-toolkit
but since:docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi
is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like thename
attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.A related issue: