error: requested docker runtime "nvidia" was not found

geekodour commented 4 months ago

I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly. nvidia-container-toolkit is also installed and am able to access gpu from container directly using docker run but unable to access from nomad.

what happened

I am running the following as a debug job:

#file: debug.hcl
  group "gpu-smi" {
    task "gpu-smi" {
      driver = "docker"

      config {
        # docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi # docker 24
        # docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi  # docker 25
        image = "nvidia/cuda:12.0.0-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }

Running: nomad job run debug.hcl,

Driver Failure: Failed to create container configuration for image "nvidia/cuda:12.0.0-base-ubuntu20.04" 
("sha256:612aabcfe23834dde204beebc9f24dd8b8180479bfd45bdeada5ee9613997955"): requested docker runtime
"nvidia" was not found

I think the issue is more around docker <-> nvidia-container-toolkit but since: docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like the name attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.

A related issue:

runtime nvidia not found because of missing nvidia-container-toolkit: https://discuss.hashicorp.com/t/requested-docker-runtime-nvidia-was-not-found/33213 but it mentioned they got it solved by installing
nvidia-container-runtime broken after nix 24.05 upgrade, possible issues might have to do with docker CDI or nvidia-container-runtime itself. https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/2
docker including CDI support post docker25: https://github.com/NixOS/nixpkgs/issues/322400
deprecated -runtime flag for --gpu: https://gitlab.com/gitlab-org/gitlab-runner/-/issues/4585
Good summary about all the packages: https://github.com/NVIDIA/nvidia-docker/issues/1268#issuecomment-632692949

geekodour commented 4 months ago

Even after having nvidia-container-toolkit in place, getting:

Failed to start container 26aa869a48b55da90438c56c5b41a831a270107da805c123786e831ee8b2615f: API error (500): failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/dcfl52x9s397zkky85kass0liyky1i57-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown

geekodour commented 4 months ago

So I experimented with a couple of different configurations, what seems to work for now:

virtualisation.docker.enableNvidia = true; (This is being deprecated here: https://github.com/NixOS/nixpkgs/issues/322400)
Using nomad-docker overlay from nix:23.11 as described here: https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/3

So nomad doesn't seem to be working with the latest nvidia-docker configurations that are set in nix. What they're moving towards is using CDI and deprecating usage of using runtime:nvidia which is something I think nomad makes use of.

I think there are few actions out of this:

Make sure that we don't completely remove virtualisation.docker.enableNvidia till we fix this because that's the only straw that seems to be holding things together for now when it comes to making it work with nomad-device-nvidia
See if we can adapt nomad to use CDI instead of using runtime:nvidia

I'll be happy to work on changes on nomad side of things if that's the way we want to go forward.

cc: @tgross

hashicorp / nomad-device-nvidia

error: requested docker runtime "nvidia" was not found #48

what happened