Closed TobTobXX closed 2 months ago
@TobTobXX Use the NixOS module in the Flake and report back. Consider donating via GitHub sponsors if you want documentation, it's one of the goals.
Nah still doesn't work:
{ pkgs, ... }:
{
imports = [
(builtins.getFlake "github:nixified-ai/flake").nixosModules.invokeai-nvidia
];
nixpkgs.config = {
allowUnfree = true;
cudaSupport = true;
};
nix.settings.trusted-substituters = ["https://ai.cachix.org"];
nix.settings.trusted-public-keys = ["ai.cachix.org-1:N9dzRK+alWwoKXQlnn0H6aUx0lU/mspIoz8hMvGvbbc="];
services.invokeai = {
enable = true;
settings = {
host = "[::]";
port = 9090;
};
};
}
I'll try to investigate further, but if you have any pointers, I'd be glad.
While I would like to contribute, I'm not in a situation to do so financially. However, I could very well work on expanding the documentation for you.
@TobTobXX If you're doing a lot of nixos-rebuild switch
es, make sure to reboot the system when messing with kernel modules. I'm not 100% sure, but it could also be that your driver is too new for interacting with this codebase. This is where a VM with GPU passthrough could resolve the impurity and incompatibility, something I'd also like to provide as part of the flake. I see you're using Cuda 12, but I built this codebase with Cuda 11.
Ok, so I did some more tests and I think the problem is most likely the mismatch between the driver's CUDA version and torch's CUDA version.
Torch appears to be compiled with CUDA 11.8, as you hinted:
[root@server:~]# nix develop github:nixified-ai/flake#invokeai-nvidia
[root@server:~]# python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 2.0.1
Is debug build: False
CUDA used to build PyTorch: 11.8
ROCM used to build PyTorch: N/A
OS: NixOS 23.11 (Tapir) (x86_64)
GCC version: (GCC) 12.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.38
Python version: 3.11.6 (main, Oct 2 2023, 13:45:54) [GCC 12.3.0] (64-bit runtime)
Python platform: Linux-6.1.84-x86_64-with-glibc2.38
Is CUDA available: False
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060
Nvidia driver version: 470.223.02
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
...
However my driver has the CUDA version 12.3, as seen above. I tried downgrading the driver to the version 470 (and rebooting, of course), but then I have the CUDA version 11.4, which yields the same error. (Which driver version do you use?)
Is there a way to upgrade torch instead?
Apparently you really can't run pytorch with mismatching CUDA versions, even if the driver's one is higher: https://stackoverflow.com/a/76726156
That's really great to have found out, thank you for the research.
Perhaps we can set this up in the nixosModule, to:
Providing a GPU pass through VM script or module is also possible, but then you have to run a VM.
(I'm new to nix, so correct me on anything I get wrong)
Option A (changing the CUDA driver):
Option B (changing torch):
Aside from the build time, Option B appears to be the better option?
InvokeAI runs with pytorch==2.0.1 (see log above). Is that specified anywhere? I tried searching this repo and the InvokeAI repo, but didn't find any version information. The latest version would be 2.2.2.
pytorch 2.0.1 only has compatibility with CUDA 11.7 and 11.8 ref pytorch 2.2.2 has compatibility with CUDA 11.8 and 12.1 ref
@TobTobXX A third option is to fix the backwards compatibility in PyTorch if you have the C++/Python skills to do so.
https://docs.nvidia.com/deploy/cuda-compatibility/index.html
Yes, the torch version is specified in Nixpkgs.
user: matthew 🌐 swordfish in ~ took 37s
❯ nix repl -L
Welcome to Nix 2.20.5. Type :? for help.
nix-repl> :lf github:nixified-ai/flake
Added 24 variables.
nix-repl> inputs.nixpkgs.legacyPackages.x86_64-linux.python3Packages.torch.version
"2.0.1"
Ooff, you mean backporting pytorch? No, I don't think I'm able to do that.
However, there is yet another option... waiting.
NixOS 24.05 isn't too far off and in this channel pytorch should be 2.2.1 (currently in unstable).
The 23.11 channel is weird anyway, because the NVIDIA driver and pytorch are essentially incompatible. I think I'll drop a question about that to the cuda maintainers.
By the way, which driver do you use? Why doesn't this occur for your GPU?
driver to match the one our torch is using
Pytorch doesn't (directly) link to the driver, instead it uses the impure runpath (addDriverRunpath
in the nixpkgs manual).
"Found no NVIDIA driver on your system"
Must mean libcuda.so
simply wasn't found (in /run/opengl-driver/lib
or through LD_LIBRARY_PATH
). There's also a slight chance that the message is wrong and libcuda.so
was found but didn't match the kernel module of the currently running system.
Start by verifying if /run/opengl-driver/lib/libcuda.so
exists (e.g. I don't see hardware.opengl.enable
in your snippet, so maybe it doesn't). Test if simpler things like nvidia-smi
and nix run -f '<nixpkgs>' --config '{ allowUnfree = true; }' cudaPackages.saxpy
work. If errors persist, run the offending commands with the LD_DEBUG=libs
environment variable set and publish the logs
Thanks a lot for dropping in!
(e.g. I don't see hardware.opengl.enable in your snippet, so maybe it doesn't)
... sigh... Do you know these times when you would like to smack your past self rather hard?
I'm terribly sorry for wasting all of your time. Thank you a lot. The generation time just went down from 550s to 13s. You rock!
I'm trying to run this on a Linux server with a RTX3060 12G.
The server runs on NixOS and has the NVIDIA driver configured:
And it seems to work:
However, when I run InvokeAI, it always chooses CPU. And if I explicitely configure
cuda
orcuda:1
(what's the difference?) I get this error:What should I do?