[bug]: fails to use CUDA device

aakropotkin commented 1 month ago

Is there an existing issue for this problem?

[X] I have searched the existing issues

Operating system

Linux

GPU vendor

Nvidia (CUDA)

GPU model

RTX 4080 Super

GPU VRAM

No response

Version number

5.0.2

Browser

chromium

Python dependencies

No response

What happened

This occurs with either the automatic installer, manual installation, and in the nix development env.

I use a NixOS system ( v3.x.x works great for me ).

The first few lines are the important ones.

invokeai-web/home/camus/repos/invokeai/venv/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
[2024-10-02 09:04:20,341]::[InvokeAI]::INFO --> Patchmatch initialized
[2024-10-02 09:04:20,991]::[InvokeAI]::INFO --> Using torch device: CPU
[2024-10-02 09:04:21,134]::[InvokeAI]::INFO --> cuDNN version: 8901
[2024-10-02 09:04:21,144]::[uvicorn.error]::INFO --> Started server process [2654760]
[2024-10-02 09:04:21,144]::[uvicorn.error]::INFO --> Waiting for application startup.
[2024-10-02 09:04:21,145]::[InvokeAI]::INFO --> InvokeAI version 5.0.2
[2024-10-02 09:04:21,145]::[InvokeAI]::INFO --> Root directory = /home/camus/invokeai-5
[2024-10-02 09:04:21,145]::[InvokeAI]::INFO --> Initializing database at /home/camus/invokeai-5/databases/invokeai.db
[2024-10-02 09:04:21,338]::[uvicorn.error]::INFO --> Application startup complete.
[2024-10-02 09:04:21,338]::[uvicorn.error]::INFO --> Uvicorn running on http://127.0.0.1:9090 (Press CTRL+C to quit)
[2024-10-02 09:04:26,131]::[uvicorn.access]::INFO --> 127.0.0.1:47428 - "GET /ws/socket.io/?EIO=4&transport=polling&t=P9DY-AE HTTP/1.1" 200
[2024-10-02 09:04:26,135]::[uvicorn.access]::INFO --> 127.0.0.1:47428 - "POST /ws/socket.io/?EIO=4&transport=polling&t=P9DY-AK&sid=wGYv9AElERWCzRASAAAA HTTP/1.1" 200
[2024-10-02 09:04:26,136]::[uvicorn.error]::INFO --> ('127.0.0.1', 47442) - "WebSocket /ws/socket.io/?EIO=4&transport=websocket&sid=wGYv9AElERWCzRASAAAA" [accepted]
[2024-10-02 09:04:26,136]::[uvicorn.error]::INFO --> connection open
[2024-10-02 09:04:26,137]::[uvicorn.access]::INFO --> 127.0.0.1:47438 - "GET /ws/socket.io/?EIO=4&transport=polling&t=P9DY-AL&sid=wGYv9AElERWCzRASAAAA HTTP/1.1" 200
[2024-10-02 09:04:26,139]::[uvicorn.access]::INFO --> 127.0.0.1:47428 - "GET /ws/socket.io/?EIO=4&transport=polling&t=P9DY-AP&sid=wGYv9AElERWCzRASAAAA HTTP/1.1" 200
[2024-10-02 09:04:26,171]::[uvicorn.access]::INFO --> 127.0.0.1:47428 - "GET /api/v1/app/invocation_cache/status HTTP/1.1" 200
[2024-10-02 09:04:26,172]::[uvicorn.access]::INFO --> 127.0.0.1:47438 - "GET /api/v1/app/version HTTP/1.1" 200
[2024-10-02 09:04:26,175]::[uvicorn.access]::INFO --> 127.0.0.1:47456 - "GET /api/v1/queue/default/status HTTP/1.1" 200
[2024-10-02 09:04:26,176]::[uvicorn.access]::INFO --> 127.0.0.1:47468 - "GET /api/v1/app/config HTTP/1.1" 200
[2024-10-02 09:04:26,176]::[uvicorn.access]::INFO --> 127.0.0.1:47476 - "GET /api/v1/queue/default/list HTTP/1.1" 200
[2024-10-02 09:04:26,176]::[uvicorn.access]::INFO --> 127.0.0.1:47478 - "GET /api/v1/boards/?all=true HTTP/1.1" 200
[2024-10-02 09:04:26,179]::[uvicorn.access]::INFO --> 127.0.0.1:47438 - "GET /api/v1/models/?model_type=controlnet HTTP/1.1" 404
[2024-10-02 09:04:26,180]::[uvicorn.access]::INFO --> 127.0.0.1:47456 - "GET /api/v1/models/?model_type=t2i_adapter HTTP/1.1" 404
[2024-10-02 09:04:26,181]::[uvicorn.access]::INFO --> 127.0.0.1:47468 - "GET /api/v1/models/?model_type=ip_adapter HTTP/1.1" 404
[2024-10-02 09:04:26,181]::[uvicorn.access]::INFO --> 127.0.0.1:47428 - "GET /api/v1/models/?model_type=lora HTTP/1.1" 404
^C[2024-10-02 09:04:26,850]::[uvicorn.error]::INFO --> Shutting down
[2024-10-02 09:04:26,851]::[uvicorn.error]::INFO --> connection closed
[2024-10-02 09:04:26,951]::[uvicorn.error]::INFO --> Waiting for application shutdown.
[2024-10-02 09:04:27,148]::[ModelInstallService]::INFO --> Installer thread 140730064438976 exiting
[2024-10-02 09:04:27,149]::[uvicorn.error]::INFO --> Application shutdown complete.
[2024-10-02 09:04:27,149]::[uvicorn.error]::INFO --> Finished server process [2654760]

I'm able to generate images from my CPU but it's really really slow.

What you expected to happen

CUDA and my graphics card should be used.

How to reproduce the problem

No response

Additional context

No response

Discord username

No response

psychedelicious commented 1 month ago

Do you have nvidia drivers installed and working?

What's the output of nvidia-smi? We expect something like this:


❯ nvidia-smi
Thu Oct  3 06:53:43 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   37C    P8              23W / 450W |    164MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | 0 N/A N/A 1517067 G /usr/lib/xorg/Xorg 137MiB | | 0 N/A N/A 1517153 G /usr/bin/gnome-shell 13MiB | +---------------------------------------------------------------------------------------+

aakropotkin commented 1 month ago

This is nvidia-smi outside of the development environment:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03              Driver Version: 560.35.03      CUDA Version: 12.6     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4080 ...    Off |   00000000:01:00.0  On |                  N/A |
| 58%   77C    P0            304W /  320W |   10650MiB /  16376MiB |     99%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A      1746      G   ...vyv5iw6fa-xorg-server-21.1.13/bin/X       2080MiB |
|    0   N/A  N/A      2542      G   ...0dgg416xd-kwin-5.27.11/bin/kwin_x11        150MiB |
|    0   N/A  N/A      2562      G   ...workspace-5.27.11.1/bin/plasmashell        112MiB |
|    0   N/A  N/A    132218      C   ...qk84j-python3-3.11.6/bin/python3.11       7842MiB |
|    0   N/A  N/A    438551      G   ...local/share/Steam/ubuntu12_32/steam          4MiB |
|    0   N/A  N/A    441248      G   ./steamwebhelper                                5MiB |
|    0   N/A  N/A    944535      G   ...irefox-130.0.1/bin/.firefox-wrapped        272MiB |
+-----------------------------------------------------------------------------------------+

I did notice that inside of the nix development environment I get this crash when trying to run nvidia-smi:

Failed to initialize NVML: Driver/library version mismatch
NVML library version: 535.86

psychedelicious commented 1 month ago

I imagine that mismatch of driver versions is a factor, but I don't know enough about Nix to be of assistance, sorry!

aakropotkin commented 1 month ago

Updating flake.lock seems to have helped - I can launch the back-end now and it detects my device. Similarly nvidia-smi inside of the shell show the correct info.

What's missing now is updates to flake.nix to allow the front-end to be built; it looks like the flake is old and expected a yarn build to take place so I'll see if I can get that updated and make a PR.

psychedelicious commented 1 month ago

Thank you! That's right, we use pnpm v8 currently (v9 is the current version but we haven't upgraded yet).

aakropotkin commented 1 month ago

https://github.com/invoke-ai/InvokeAI/pull/7032

invoke-ai / InvokeAI