Found no NVIDIA driver on your system

TobTobXX commented 2 months ago

I'm trying to run this on a Linux server with a RTX3060 12G.

The server runs on NixOS and has the NVIDIA driver configured:

# ...
    nixpkgs.config = {
        allowUnfree = true;
        cudaSupport = true;
    };
    services.xserver.videoDrivers = [ "nvidia" ];
    hardware.nvidia = {
        nvidiaSettings = false;

        # Optionally, you may need to select the appropriate driver version for your specific GPU.
        package = config.boot.kernelPackages.nvidiaPackages.beta;
    };
# ...

And it seems to work:

[root@server:~]# nvidia-smi
Mon Apr  8 20:36:31 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.02              Driver Version: 545.29.02    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3060        Off | 00000000:06:00.0 Off |                  N/A |
| 34%   41C    P0              34W / 170W |      1MiB / 12288MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

However, when I run InvokeAI, it always chooses CPU. And if I explicitely configure cuda or cuda:1 (what's the difference?) I get this error:

...
  File "/nix/store/f3iw0nk6bcx51mzzz6bqw6r0hvvfxyb7-python3.11-torch-2.0.1/lib/python3.11/site-packages/torch/cuda/__init__.py", line 247, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

What should I do?

MatthewCroughan commented 2 months ago

@TobTobXX Use the NixOS module in the Flake and report back. Consider donating via GitHub sponsors if you want documentation, it's one of the goals.

TobTobXX commented 2 months ago

Nah still doesn't work:

{ pkgs, ... }:

{
    imports = [
        (builtins.getFlake "github:nixified-ai/flake").nixosModules.invokeai-nvidia
    ];
    nixpkgs.config = {
        allowUnfree = true;
        cudaSupport = true;
    };
    nix.settings.trusted-substituters = ["https://ai.cachix.org"];
    nix.settings.trusted-public-keys = ["ai.cachix.org-1:N9dzRK+alWwoKXQlnn0H6aUx0lU/mspIoz8hMvGvbbc="];
    services.invokeai = {
        enable = true;
        settings = {
            host = "[::]";
            port = 9090;
        };
    };
}

I'll try to investigate further, but if you have any pointers, I'd be glad.

While I would like to contribute, I'm not in a situation to do so financially. However, I could very well work on expanding the documentation for you.

MatthewCroughan commented 2 months ago

@TobTobXX If you're doing a lot of nixos-rebuild switches, make sure to reboot the system when messing with kernel modules. I'm not 100% sure, but it could also be that your driver is too new for interacting with this codebase. This is where a VM with GPU passthrough could resolve the impurity and incompatibility, something I'd also like to provide as part of the flake. I see you're using Cuda 12, but I built this codebase with Cuda 11.

TobTobXX commented 2 months ago

Ok, so I did some more tests and I think the problem is most likely the mismatch between the driver's CUDA version and torch's CUDA version.

Torch appears to be compiled with CUDA 11.8, as you hinted:

[root@server:~]# nix develop github:nixified-ai/flake#invokeai-nvidia

[root@server:~]# python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A

OS: NixOS 23.11 (Tapir) (x86_64)
GCC version: (GCC) 12.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.38

Python version: 3.11.6 (main, Oct  2 2023, 13:45:54) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.1.84-x86_64-with-glibc2.38
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 470.223.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

...

However my driver has the CUDA version 12.3, as seen above. I tried downgrading the driver to the version 470 (and rebooting, of course), but then I have the CUDA version 11.4, which yields the same error. (Which driver version do you use?)

Is there a way to upgrade torch instead?

TobTobXX commented 2 months ago

Apparently you really can't run pytorch with mismatching CUDA versions, even if the driver's one is higher: https://stackoverflow.com/a/76726156

MatthewCroughan commented 2 months ago

That's really great to have found out, thank you for the research.

Perhaps we can set this up in the nixosModule, to:

force the CUDA driver to match the one our torch is using
overlay such that torch is recompiled based on your system's CUDA

Providing a GPU pass through VM script or module is also possible, but then you have to run a VM.

TobTobXX commented 2 months ago

(I'm new to nix, so correct me on anything I get wrong)

Option A (changing the CUDA driver):

Pro: NVIDIA driver is built way faster than pytorch.
Con: Adding Nixified requires a reboot
Con: Relatively invasive side-effect on system
Con: Only a select few versions are available (notably, a driver with CUDA 11.8<=x<12.0 is not, see https://wiki.nixos.org/wiki/Nvidia#Determining_the_Correct_Driver_Version)

Option B (changing torch):

Pro: Better cooperation with the rest of the system
Con: Pytorch takes aaaageees to compile (that's why I've landed here)

Aside from the build time, Option B appears to be the better option?

InvokeAI runs with pytorch==2.0.1 (see log above). Is that specified anywhere? I tried searching this repo and the InvokeAI repo, but didn't find any version information. The latest version would be 2.2.2.

pytorch 2.0.1 only has compatibility with CUDA 11.7 and 11.8 ref pytorch 2.2.2 has compatibility with CUDA 11.8 and 12.1 ref

MatthewCroughan commented 2 months ago

@TobTobXX A third option is to fix the backwards compatibility in PyTorch if you have the C++/Python skills to do so.

https://docs.nvidia.com/deploy/cuda-compatibility/index.html

Yes, the torch version is specified in Nixpkgs.

user: matthew 🌐 swordfish in ~ took 37s 
❯ nix repl -L
Welcome to Nix 2.20.5. Type :? for help.

nix-repl> :lf github:nixified-ai/flake
Added 24 variables.

nix-repl> inputs.nixpkgs.legacyPackages.x86_64-linux.python3Packages.torch.version
"2.0.1"

TobTobXX commented 2 months ago

Ooff, you mean backporting pytorch? No, I don't think I'm able to do that.

However, there is yet another option... waiting.

NixOS 24.05 isn't too far off and in this channel pytorch should be 2.2.1 (currently in unstable).

The 23.11 channel is weird anyway, because the NVIDIA driver and pytorch are essentially incompatible. I think I'll drop a question about that to the cuda maintainers.

TobTobXX commented 2 months ago

By the way, which driver do you use? Why doesn't this occur for your GPU?

SomeoneSerge commented 2 months ago

driver to match the one our torch is using

Pytorch doesn't (directly) link to the driver, instead it uses the impure runpath (addDriverRunpath in the nixpkgs manual).

"Found no NVIDIA driver on your system"

Must mean libcuda.so simply wasn't found (in /run/opengl-driver/lib or through LD_LIBRARY_PATH). There's also a slight chance that the message is wrong and libcuda.so was found but didn't match the kernel module of the currently running system.

Start by verifying if /run/opengl-driver/lib/libcuda.so exists (e.g. I don't see hardware.opengl.enable in your snippet, so maybe it doesn't). Test if simpler things like nvidia-smi and nix run -f '<nixpkgs>' --config '{ allowUnfree = true; }' cudaPackages.saxpy work. If errors persist, run the offending commands with the LD_DEBUG=libs environment variable set and publish the logs

TobTobXX commented 2 months ago

Thanks a lot for dropping in!

(e.g. I don't see hardware.opengl.enable in your snippet, so maybe it doesn't)

... sigh... Do you know these times when you would like to smack your past self rather hard?

I'm terribly sorry for wasting all of your time. Thank you a lot. The generation time just went down from 550s to 13s. You rock!

nixified-ai / flake

Found no NVIDIA driver on your system #92