hashicorp / nomad-device-nvidia

Nomad device driver for Nvidia GPU
Mozilla Public License 2.0
19 stars 9 forks source link

undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV when trying to build in nix #47

Closed geekodour closed 1 month ago

geekodour commented 2 months ago

I was trying to build it for my local nix packages,

# see https://github.com/NixOS/nixpkgs/pull/304108
# see https://github.com/hashicorp/nomad-device-nvidia
{ lib, buildGoModule, fetchFromGitHub }:

buildGoModule rec {
  pname = "nomad-device-nvidia";
  version = "66fe3a14e471f4844dffa13ada3c6fdadcd98ab7"; # Jul 10, 2024

  src = fetchFromGitHub {
    owner = "hashicorp";
    repo = pname;
    rev = "${version}";
    sha256 = "sha256-2zdTslzWnaWg3I4bijYIU+nBDsab25iVO8x7v5ymamM=";
  };

  vendorHash = "sha256-h2qp/wHlvqiNLl6dw7UD+/G0iPfdEj8KsACCRMSUYaI=";

  subPackages = [ "." ];

  meta = with lib; {
    homepage = "https://github.com/hashicorp/nomad-device-nvidia";
    description = "Nomad device plugin for Nvidia GPUs";
    mainProgram = "nomad-device-nvidia";
    platforms = platforms.linux;
    license = licenses.mpl20;
    maintainers = with maintainers; [ geekodour ];
  };
}

But then I get the error:

error: builder for '/nix/store/lk3fq47dx04hj413znvaf8xi0b3p3n5h-nomad-device-nvidia-66fe3a14e471f4844dffa13ada3c6fdadcd98ab7.drv' failed with exit code 1;
       last 10 log lines:
       > source root is source
       > Running phase: patchPhase
       > Running phase: updateAutotoolsGnuConfigScriptsPhase
       > Running phase: configurePhase
       > Running phase: buildPhase
       > Building subPackage ./.
       > Running phase: checkPhase
       > /build/go-build727756788/b001/nomad-device-nvidia.test: symbol lookup error: /build/go-build727756788/b001/nomad-device-nvidia.test: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV
       > FAIL   github.com/hashicorp/nomad-device-nvidia        0.001s
       > FAIL
       For full logs, run 'nix log /nix/store/lk3fq47dx04hj413znvaf8xi0b3p3n5h-nomad-device-nvidia-66fe3a14e471f4844dffa13ada3c6fdadcd98ab7.drv'.

I am trying to get this to work at the moment, will post updates. Let me know if any suggestions around what should fix this.

I think https://github.com/hashicorp/nomad-device-nvidia/issues/34 might be related.

geekodour commented 2 months ago

current workaround:

  doCheck = false;

EDIT: This did not solve the issue. It simply skipped the test but now the binary that's built throws the following:

./cmd: symbol lookup error: ./cmd: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV

I spent sometime trying to figure out what's wrong as I was using a fairly straightforward buildGoModule. After a while, I decided to skip solving for this and thought vendoring the dependencies will simplify this issue a bit so I created a fork (https://github.com/geekodour/nomad-device-nvidia) with the dependencies vendor'ed (go mod tidy, go mod vendor).

I could verify the existence of vendor/github.com/NVIDIA/go-nvml/pkg/nvml/ in my fork, which is referred here: https://github.com/hashicorp/nomad-device-nvidia/blob/66fe3a14e471f4844dffa13ada3c6fdadcd98ab7/nvml/driver_linux.go#L11

Now I do a nix build and I get:

λ nix build --impure --no-link --print-out-paths .#nomad-device-nvidia
path '/home/geekodour/x/newnixsetup/pkgs' does not contain a 'flake.nix', searching up
warning: Git tree '/home/geekodour/x' is dirty
error: builder for '/nix/store/yi0xqzsydd9xs4a1j0l7s4vi85wdv77k-nomad-device-nvidia-8598e31a0a38a9ed5e14451cf86ab8a8211ab98b.drv' failed with exit code 1;
       last 9 log lines:
       > Running phase: unpackPhase
       > unpacking source archive /nix/store/32hmlbaxdcspv9qq1rk4v1i60h4lws9y-source
       > source root is source
       > Running phase: patchPhase
       > Running phase: updateAutotoolsGnuConfigScriptsPhase
       > Running phase: configurePhase
       > Running phase: buildPhase
       > Building subPackage ./cmd
       > nvml/driver_linux.go:11:2: cannot find module providing package github.com/NVIDIA/go-nvml/pkg/nvml: import lookup disabled by -mod=vendor
       For full logs, run 'nix log /nix/store/yi0xqzsydd9xs4a1j0l7s4vi85wdv77k-nomad-device-nvidia-8598e31a0a38a9ed5e14451cf86ab8a8211ab98b.drv'.

This does not make any sense. I tried digging into more issues, found one related issue where problems were caused by the use of uppercase letters: https://github.com/NixOS/nixpkgs/issues/273998#issuecomment-1936601932

at this point i am clueless, so I am re-opening the issue even if its not directly related to nomad-device-nvidia(the makefile commands directly are working absolutely fine) but more of a nix issue at this point or me messing something up.

Full error(when using vendored mod):

warning: The interpretation of store paths arguments ending in `.drv` recently changed. If this command is now failing try again with '/nix/store/qlm42n6c6wl514fg0bdfdl1f022axlrg-nomad-device-nvidia-8598e31a0a38a9ed5e14451cf86ab8a8211ab98b.drv^*'
Sourcing auto-add-driver-runpath-hook
Using autoAddDriverRunpath
Sourcing fix-elf-files.sh
@nix { "action": "setPhase", "phase": "unpackPhase" }
Running phase: unpackPhase
unpacking source archive /nix/store/32hmlbaxdcspv9qq1rk4v1i60h4lws9y-source
source root is source
@nix { "action": "setPhase", "phase": "patchPhase" }
Running phase: patchPhase
@nix { "action": "setPhase", "phase": "updateAutotoolsGnuConfigScriptsPhase" }
Running phase: updateAutotoolsGnuConfigScriptsPhase
@nix { "action": "setPhase", "phase": "configurePhase" }
Running phase: configurePhase
@nix { "action": "setPhase", "phase": "buildPhase" }
Running phase: buildPhase
Building subPackage ./cmd
nvml/driver_linux.go:11:2: cannot find module providing package github.com/NVIDIA/go-nvml/pkg/nvml: import lookup disabled by -mod=vendor
        (Go version in go.mod is at least 1.14 and vendor directory exists.)
geekodour commented 2 months ago

Reproducible example:

# see https://github.com/NixOS/nixpkgs/pull/304108
# see https://github.com/hashicorp/nomad-device-nvidia
# see https://github.com/geekodour/nomad-device-nvidia
{ lib, pkgs, buildGoModule, fetchFromGitHub }:

buildGoModule rec {
  pname = "nomad-device-nvidia";
  version = "8598e31a0a38a9ed5e14451cf86ab8a8211ab98b"; # Jul 27, 2024

  #nativeBuildInputs = [ pkgs.autoAddDriverRunpath ];

  CGO_ENABLED = 1;
  # GOOS = "linux";
  # GOARCH = "amd64";

  # doCheck = true;
  # doInstallCheck = false;
  # runVend = true;
  proxyVendor = true;
  # deleteVendor = true;

  src = fetchFromGitHub {
    owner = "geekodour";
    repo = pname;
    rev = "${version}";
    sha256 = "sha256-urASq/T4XcDVUp03bCKqvojCjLrGb+l47JbZWsHbSGg=";
    # sha256 = lib.fakeHash;
  };

  vendorHash = null;

  subPackages = [ "cmd" ];
  # subPackages = [ "." ];

  meta = with lib; {
    homepage = "https://github.com/hashicorp/nomad-device-nvidia";
    description = "Nomad device plugin for Nvidia GPUs";
    mainProgram = "nomad-device-nvidia";
    platforms = platforms.linux;
    license = licenses.mpl20;
    maintainers = with maintainers; [ geekodour ];
  };
}
geekodour commented 2 months ago

I adopted a very rough workaround for now, have one directory in my homedir where I have the compiled binary and using it from a overlay package:

{ lib, pkgs, stdenv, fetchFromGitHub }:

stdenv.mkDerivation rec {
  name = "nomad-device-nvidia";
  src = /home/geekodour/infra/nomad-plugins/nomad-device-nvidia;

  nativeBuildInputs = [pkgs.autoAddDriverRunpath];
  buildPhase = "";
  dontUnpack = true;
  doCheck = false;
  installPhase = ''
    #cp -r $src $out
    mkdir -p $out/bin
    cp $src/nomad-device-nvidia $out/bin
  '';

  meta = with lib; {
    homepage = "https://github.com/hashicorp/nomad-device-nvidia";
    description = "Nomad device plugin for Nvidia GPUs";
    mainProgram = "nomad-device-nvidia";
    platforms = platforms.linux;
    license = licenses.mpl20;
    maintainers = with maintainers; [ geekodour ];
  };
}
shoenig commented 1 month ago

Hi @geekodour glad you got it working. While we can appreciate NixOS we don't have the expertise to help support it; the driver builds and packages well on the [mainstream] distros we support customers with.