hashicorp / nomad-device-nvidia

Nomad device driver for Nvidia GPU
Mozilla Public License 2.0
19 stars 9 forks source link

error: requested docker runtime "nvidia" was not found #48

Open geekodour opened 2 months ago

geekodour commented 2 months ago

I am trying to run nomad+docker+nvidia+nixos, the drivers are installed and https://github.com/hashicorp/nomad-device-nvidia plugin is setup correctly and GPU is getting fingerprinted correctly. nvidia-container-toolkit is also installed and am able to access gpu from container directly using docker run but unable to access from nomad.

what happened

I am running the following as a debug job:

#file: debug.hcl
  group "gpu-smi" {
    task "gpu-smi" {
      driver = "docker"

      config {
        # docker run --rm --gpus all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi # docker 24
        # docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi  # docker 25
        image = "nvidia/cuda:12.0.0-base-ubuntu20.04"
        command = "nvidia-smi"
      }

      resources {
        device "nvidia/gpu" {
          count = 1
        }
      }
    }
  }

Running: nomad job run debug.hcl,

Driver Failure: Failed to create container configuration for image "nvidia/cuda:12.0.0-base-ubuntu20.04" 
("sha256:612aabcfe23834dde204beebc9f24dd8b8180479bfd45bdeada5ee9613997955"): requested docker runtime
"nvidia" was not found

I think the issue is more around docker <-> nvidia-container-toolkit but since: docker run --rm --device=nvidia.com/gpu=all nvidia/cuda:12.0.0-base-ubuntu20.04 nvidia-smi is working as expected but running the same in nomad gives back an error, creating an issue. Also seems like the name attribute mentioned here: https://developer.hashicorp.com/nomad/docs/v1.6.x/job-specification/device#device-parameters does not work with HCL2? I tried setting it with no luck, will try looking into the sources and post updates if I find something interesting.

A related issue:

geekodour commented 2 months ago

Even after having nvidia-container-toolkit in place, getting:

Failed to start container 26aa869a48b55da90438c56c5b41a831a270107da805c123786e831ee8b2615f: API error (500): failed to create task for container: failed to create shim task: OCI runtime create failed: /nix/store/dcfl52x9s397zkky85kass0liyky1i57-nvidia-docker/bin/nvidia-container-runtime did not terminate successfully: exit status 125: unknown
geekodour commented 2 months ago

So I experimented with a couple of different configurations, what seems to work for now:

  1. virtualisation.docker.enableNvidia = true; (This is being deprecated here: https://github.com/NixOS/nixpkgs/issues/322400)
  2. Using nomad-docker overlay from nix:23.11 as described here: https://discourse.nixos.org/t/nvidia-container-runtime-exit-status-125-unknown/48306/3

So nomad doesn't seem to be working with the latest nvidia-docker configurations that are set in nix. What they're moving towards is using CDI and deprecating usage of using runtime:nvidia which is something I think nomad makes use of.

I think there are few actions out of this:

  1. Make sure that we don't completely remove virtualisation.docker.enableNvidia till we fix this because that's the only straw that seems to be holding things together for now when it comes to making it work with nomad-device-nvidia
  2. See if we can adapt nomad to use CDI instead of using runtime:nvidia

I'll be happy to work on changes on nomad side of things if that's the way we want to go forward.

cc: @tgross