Not finding CUDA when it's installed both in conda env and for Windows globally (different versions)

ewebgh33 commented 11 months ago

Hi Not sure what is happening here, but when I try to run python server.py, it says No CUDA runtime is found.

Is a specific version needed? Which one?

ie, I have 12.0 installed with PATH set in Win11 environment variables. I use 12.0 for 3D rendering and design. C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.0 It says not found here though, should I assume 12.0 is not the version we need then?

So, I also installed 12.3 inside the conda environment "exui". It's not seeing this either. I though the point of an environment was that it looks in there first, and then outside if the components it needs are not present in the env. If I have to set the windows PATH for every environment I run, I'll be forever switching the PATH all day.

Or is 12.3 too new? Do I need 12.2 instead?

This GUI/app looks really good, but I think the install instructions could be a bit more detailed and take into account an environment manager or two.

codethinki commented 11 months ago

look at the torch documentation it says 12.1 :)

get the right torch package here -> https://pytorch.org/get-started/locally/

ewebgh33 commented 11 months ago

Thanks I didn't see it in the documentation, just looked at the main github page and there is no link to docs. Maybe a line under "Running locally" on the main repo page, "You will need CUDA 12.1 and etc etc, then git clone etc.". :)

Will 12.3 work as well? Doing a system update and it seems like I should get on the latest, other LLM apps need 12.2 and I hope 12.3 will at least get me through the next few months!

codethinki commented 11 months ago

no i don’t think so. maybe you can customise torch trough some configs but out of the box (as far as i know) thats not possible

ewebgh33 commented 11 months ago

Alright so I uninstalled CUDA and installed 12.1. Check win environment variable, all seems OK. Delete conda env to start fresh. Set up new one. Pip install requirements. Run server.py.

Error:

(exui) C:\AI\Text\exui>python server.py
No CUDA runtime is found, using CUDA_HOME='C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.1'
Traceback (most recent call last):
  File "C:\AI\Text\exui\server.py", line 11, in <module>
    from backend.models import update_model, load_models, get_model_info, list_models, remove_model, load_model, unload_model, get_loaded_model
  File "C:\AI\Text\exui\backend\models.py", line 5, in <module>
    from exllamav2 import(
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\__init__.py", line 3, in <module>
    from exllamav2.model import ExLlamaV2
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\model.py", line 17, in <module>
    from exllamav2.cache import ExLlamaV2CacheBase
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\cache.py", line 2, in <module>
    from exllamav2.ext import exllamav2_ext as ext_c
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\exllamav2\ext.py", line 131, in <module>
    exllamav2_ext = load \
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py", line 1308, in load
    return _jit_compile(
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py", line 1710, in _jit_compile
    _write_ninja_file_and_build_library(
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py", line 1810, in _write_ninja_file_and_build_library
    _write_ninja_file_to_build_library(
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py", line 2199, in _write_ninja_file_to_build_library
    cuda_flags = common_cflags + COMMON_NVCC_FLAGS + _get_cuda_arch_flags()
  File "C:\Users\ComputeyName\AppData\Local\Programs\Python\Python310\lib\site-packages\torch\utils\cpp_extension.py", line 1980, in _get_cuda_arch_flags
    arch_list[-1] += '+PTX'
IndexError: list index out of range

Why would it not find it when it's exactly where it says it's looking? Because:

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Wed_Feb__8_05:53:42_Coordinated_Universal_Time_2023
Cuda compilation tools, release 12.1, V12.1.66
Build cuda_12.1.r12.1/compiler.32415258_0

It's there? Driver also supports it, I did check that too.

turboderp commented 11 months ago

The failure seems to happen in PyTorch, at this point:

        arch_list = []
        # the assumption is that the extension should run on any of the currently visible cards,
        # which could be of different types - therefore all archs for visible cards should be included
        for i in range(torch.cuda.device_count()):
            capability = torch.cuda.get_device_capability(i)
            supported_sm = [int(arch.split('_')[1])
                            for arch in torch.cuda.get_arch_list() if 'sm_' in arch]
            max_supported_sm = max((sm // 10, sm % 10) for sm in supported_sm)
            # Capability of the device may be higher than what's supported by the user's
            # NVCC, causing compilation error. User's NVCC is expected to match the one
            # used to build pytorch, so we use the maximum supported capability of pytorch
            # to clamp the capability.
            capability = min(max_supported_sm, capability)
            arch = f'{capability[0]}.{capability[1]}'
            if arch not in arch_list:
                arch_list.append(arch)
        arch_list = sorted(arch_list)
        arch_list[-1] += '+PTX'

It fails on the last line indexing the last element of arch_list, which means that list is empty. The only way I can see that happening is if torch.cuda.device_count() is zero, i.e. Torch has not recognized any CUDA devices.

Are you sure you have the CUDA-enabled version of Torch installed? pip freeze should show torch==...+cu121 I believe. The "No CUDA runtime is found" error is also emitted by Torch, so it does look like you have the CUDA version, but it really can't find the CUDA runtime, which would be provided by the NVIDIA driver. I don't know if maybe that's not installed, or not available somehow? Do you get anything from running nvidia-smi?

ewebgh33 commented 11 months ago

for "pip show torch":

Name: torch
Version: 2.1.2
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: packages@pytorch.org
License: BSD-3
Location: c:\users\hesperos\appdata\local\programs\python\python310\lib\site-packages
Requires: filelock, fsspec, jinja2, networkx, sympy, typing-extensions
Required-by: exllamav2

pip freeze shows torch==2.1.2 or in anaconda it shows torch==2.1.0+cu121 but then once the environment is active anaconda also shows torch==2.1.2

nvidia-smi shows NVIDIA-SMI 546.12 Driver Version: 546.12 CUDA Version: 12.3 And the 2x 4090s. of course, the 12.3 there just means that the driver is compatible up to that version of CUDA as you know.

So that's the weird thing, torch is there, CUDA is there, etc. Or is it? I need a torch install inside the env. But it wasn't in the requirements... it's a requirement though? Hm Now after installing torch I get "No module named 'flask'" which I suppose means I am supposed to have an environment variable for it. But if I do >flask --version I get

Python 3.10.8
Flask 3.0.0
Werkzeug 3.0.1

So I don't know. I have it or not.

turboderp commented 11 months ago

You definitely have the non-CUDA version of Torch. Why 2.1.0+cu121 shows up in Anaconda I don't know. In any case, I would do:

pip uninstall torch
pip install torch --index-url https://download.pytorch.org/whl/cu121

There's definitely something up with your conda envs though. flask is a requirement and shouldn't need any environment variables. You should be able to install it with:

pip install torch waitress

(waitress being the next requirement it would likely not find if it isn't finding flask.)

predtech1988 commented 11 months ago

Had same issue today, installed from anaconda :

conda create -n exui python==3.10
conda activate exui
pip install -r requirements.txt

And received same error. To fix that:

pip uninstall torch # and generated right version of pytorch on their site
pip install torch --index-url https://download.pytorch.org/whl/cu121

Now everything is working. I guess it would be nice to add some note in installation guide for conda users.

JLuke73 commented 9 months ago

I had this issue today, because I'm using 12.1 instead of 12.2. The readme doc currently refers to 12.1, but the install expects 12.2. We could be more generic about it to help other users troubleshoot, maybe: "Run torch freeze, make sure you have the version listed installed in (/.../), otherwise pip uninstall torch (...) pip install torch (...), or install the correct version of Torch from (...)."

codethinki commented 9 months ago

true something like a "currently supported versions" list would be helpful

turboderp / exui

Not finding CUDA when it's installed both in conda env and for Windows globally (different versions) #26