turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

importing exllamav2.generator stops here #505

Open lovebeatz opened 2 weeks ago

lovebeatz commented 2 weeks ago

https://github.com/turboderp/exllamav2/blob/5996922a0f0937aa503efa773780f1648915d73e/exllamav2/ext.py#L281

Error is name 'exllamav2_ext' is not defined

install method is build from source, from the cp311 whl provided, cp310 also tried in a different conda envirionment

didn't try any old version, looking for a reliable way to serve LLMs

also, 'rich' needs to be installed separately, as of now, shall be included in the whl

turboderp commented 2 weeks ago

Do you get any warnings before that?

Also, what version of Torch are you using, what CUDA version, etc.?

lovebeatz commented 2 weeks ago

here are the steps I followed: create new conda enviornment with python 3.11 installed exllamav2 from wheel

am I missing out on separate installation of torch or cuda? (torch.cuda.empty_cache() works and clears L4 GPU memory, running on linux) because installation goes well, only when you import, it shows 'rich' error and after installing rich, exllamav2 is the next issue lined up also. clone repo method installation fails at pip install . with some error (both whether torch with cuda installed or not)

turboderp commented 2 weeks ago

It's strange, because the installation should fail if it isn't able to install the exllamav2_ext module, and running without that module installed should prompt it to build it at runtime, and if that fails, then you should get an error from torch.utils.cpp_extension.load. It seems that in both cases, that latter function is failing silently for some reason.

CUDA toolkit is required for building from source, as is a CUDA-enabled version of PyTorch. But I think there's something up with your env perhaps, since rich should be a dependency for the wheel and is in requirements.txt. What's your exact PyTorch version? (pip show torch)

You could try setting verbose = True at the top of ext.py. That should at least give you output from when it tries to compile the extension.

lovebeatz commented 2 weeks ago

Will ssh into the server tomorrow, I think if I install from the wheel, then I can't setup verbose, the only change I can make is installing torch with cuda before I install from wheel

lovebeatz commented 2 weeks ago

here's something I tested, I simply installed pytorch 2.3.1 with cuda 12.1 via conda before ran pip install exllamav2 via whl

rich still needs to be installed separately (error shows up when you run the code not while installation), and no change in behavior regarding exllamav2_ext, so it's clear that torch doesn't need to be installed separately, as this time this didn't make much time to install from whl, without torch, it takes care of torch but takes time to install

code I ran from exllamav2.generator import ( ExLlamaV2Sampler, )

error I got


NameError Traceback (most recent call last) Cell In[2], line 1 ----> 1 from exllamav2.generator import ( 2 ExLlamaV2Sampler, 3 )

File ~/miniconda3/envs/agentic/lib/python3.11/site-packages/exllamav2/init.py:3 1 from exllamav2.version import version ----> 3 from exllamav2.model import ExLlamaV2 4 from exllamav2.cache import ExLlamaV2CacheBase 5 from exllamav2.cache import ExLlamaV2Cache

File ~/miniconda3/envs/agentic/lib/python3.11/site-packages/exllamav2/model.py:31 28 print("") 30 import math ---> 31 from exllamav2.config import ExLlamaV2Config 32 from exllamav2.cache import ExLlamaV2CacheBase 33 from exllamav2.linear import ExLlamaV2Linear

File ~/miniconda3/envs/agentic/lib/python3.11/site-packages/exllamav2/config.py:5 3 import torch 4 import math ----> 5 from exllamav2.fasttensors import STFile 6 from exllamav2.architecture import ExLlamaV2ArchParams 7 import os, glob, json

File ~/miniconda3/envs/agentic/lib/python3.11/site-packages/exllamav2/fasttensors.py:6 4 import numpy as np 5 import json ----> 6 from exllamav2.ext import exllamav2_ext as ext_c 7 import os 9 def convert_dtype(dt: str):

File ~/miniconda3/envs/agentic/lib/python3.11/site-packages/exllamav2/ext.py:281 278 timer.cancel() 279 end_build_feedback() --> 281 ext_c = exllamav2_ext 284 # Dummy tensor to pass to C++ extension in place of None/NULL 286 none_tensor = torch.empty((1, 1), device = "meta")

NameError: name 'exllamav2_ext' is not defined

turboderp commented 2 weeks ago

But this would imply that all the code before line 281 in ext.py runs. It includes:

build_jit = False
try:
    import exllamav2_ext
except ModuleNotFoundError:
    build_jit = True
except ImportError as e:
    if "undefined symbol" in str(e):
        print("\"undefined symbol\" error here usually means you are attempting to load a prebuilt extension wheel "
              "that was compiled against a different version of PyTorch than the one you are you using. Please verify "
              "that the versions match.")
        raise e

Which should either define the exllamav2_ext symbol, or set build_jit = True, or raise if there's any other exception besides ModuleNotFoundError. Then, if build_jit is true, this runs later down:

        exllamav2_ext = load \
        (
            name = extension_name,
            sources = sources,
            extra_include_paths = [sources_dir],
            verbose = verbose,
            extra_ldflags = extra_ldflags,
            extra_cuda_cflags = extra_cuda_cflags,
            extra_cflags = extra_cflags
        )

Which once again either raises an exception or defines the exllamav2_ext symbol. And yet somehow that executes without defining the symbol so you get an error right after:

ext_c = exllamav2_ext

Maybe I've just stared at it for too long, but I'm not seeing any error in the logic there.

lovebeatz commented 2 weeks ago

As of now, the only way out for me is looking for a previous version where this doesn't happen, not feeling very motivated for this, because I believe you would fix it

turboderp commented 2 weeks ago

Problem is I can't reproduce the error. The only time I've ever seen something like it is when someone has a wrong version of Torch installed for their hardware, but that doesn't seem to be the case here.

I can suggest you maybe try clearing out the extension cache directory at ~/.cache/torch_extensions, since maybe there are some corrupted build files there. Other than that, if you could provide some more details about your hardware and what library versions you have installed:

nvidia-smi
pip show torch exllamav2
nvcc --version
gcc --version
lovebeatz commented 2 weeks ago

I tried it on servers, L4 and 3090 I create the test environments, whatever doesn't work, I delete them, on your ask, I refollowed the process with manual pre-installation of pytorch using conda, 2.3.1 and 12.1 cuda

lovebeatz commented 2 weeks ago

This is after direct wheel install with python 3.9, next I would try with pytorch=2.0.1 and cuda 11.8

Sat Jun 15 03:03:26 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:00:05.0 Off | N/A | | 0% 27C P8 18W / 350W | 1MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ Name: torch Version: 2.3.1 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /home/ubuntu/miniconda3/envs/test/lib/python3.9/site-packages Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions Required-by: exllamav2

Name: exllamav2 Version: 0.1.5+cu117.torch2.0.1 Summary: Home-page: https://github.com/turboderp/exllamav2 Author: turboderp Author-email: License: MIT Location: /home/ubuntu/miniconda3/envs/test/lib/python3.9/site-packages Requires: fastparquet, ninja, numpy, pandas, pygments, regex, safetensors, sentencepiece, torch, websockets Required-by: /bin/bash: line 1: nvcc: command not found gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

turboderp commented 2 weeks ago

For that PyTorch version at least, you'll want this wheel:

pip install -U https://github.com/turboderp/exllamav2/releases/download/v0.1.5/exllamav2-0.1.5+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

Assuming you're still on Python 3.11

lovebeatz commented 2 weeks ago

Also, if I clone repo and follow the steps with python 3.11, pip install .

(exllama) ubuntu@gc-vigilant-exllama:~/libraries/exllamav2$ pip install . Processing /home/ubuntu/libraries/exllamav2 Preparing metadata (setup.py) ... error error: subprocess-exited-with-error

× python setup.py egg_info did not run successfully. │ exit code: 1 ╰─> [14 lines of output] Traceback (most recent call last): File "", line 2, in File "", line 34, in File "/home/ubuntu/libraries/exllamav2/setup.py", line 31, in cpp_extension.CUDAExtension( File "/home/ubuntu/miniconda3/envs/exllama/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1077, in CUDAExtension library_dirs += library_paths(cuda=True) ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniconda3/envs/exllama/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 1204, in library_paths if (not os.path.exists(_join_cuda_home(lib_dir)) and ^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/ubuntu/miniconda3/envs/exllama/lib/python3.11/site-packages/torch/utils/cpp_extension.py", line 2419, in _join_cuda_home raise OSError('CUDA_HOME environment variable is not set. ' OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root. [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip. error: metadata-generation-failed

× Encountered error while generating package metadata. ╰─> See above for output.

note: This is an issue with the package mentioned above, not pip.

hardware details

Sat Jun 15 03:23:31 2024
+---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.161.07 Driver Version: 535.161.07 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 NVIDIA GeForce RTX 3090 On | 00000000:00:05.0 Off | N/A | | 0% 27C P8 18W / 350W | 1MiB / 24576MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+ WARNING: Package(s) not found: exllamav2 Name: torch Version: 2.3.1 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page: https://pytorch.org/ Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /home/ubuntu/miniconda3/envs/exllama/lib/python3.11/site-packages Requires: filelock, fsspec, jinja2, networkx, nvidia-cublas-cu12, nvidia-cuda-cupti-cu12, nvidia-cuda-nvrtc-cu12, nvidia-cuda-runtime-cu12, nvidia-cudnn-cu12, nvidia-cufft-cu12, nvidia-curand-cu12, nvidia-cusolver-cu12, nvidia-cusparse-cu12, nvidia-nccl-cu12, nvidia-nvtx-cu12, sympy, triton, typing-extensions Required-by: /bin/bash: line 1: nvcc: command not found gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Copyright (C) 2021 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

turboderp commented 2 weeks ago

That error is from not having the CUDA toolkit installed, which is needed for building from source. But try with a cu121 wheel to match your PyTorch version.

lovebeatz commented 2 weeks ago

For that PyTorch version at least, you'll want this wheel:

pip install -U https://github.com/turboderp/exllamav2/releases/download/v0.1.5/exllamav2-0.1.5+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

Assuming you're still on Python 3.11

installed this wheel https://github.com/turboderp/exllamav2/releases/download/v0.1.5/exllamav2-0.1.5+cu118.torch2.3.1-cp312-cp312-linux_x86_64.whl while on python 3.12

from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

{ "name": "AttributeError", "message": "module 'torch' has no attribute 'version'", "stack": "--------------------------------------------------------------------------- AttributeError Traceback (most recent call last) Cell In[1], line 1 ----> 1 from exllamav2.generator import ExLlamaV2StreamingGenerator, ExLlamaV2Sampler

File ~/miniconda3/envs/exllama/lib/python3.12/site-packages/exllamav2/init.py:3 1 from exllamav2.version import version ----> 3 from exllamav2.model import ExLlamaV2 4 from exllamav2.cache import ExLlamaV2CacheBase 5 from exllamav2.cache import ExLlamaV2Cache

File ~/miniconda3/envs/exllama/lib/python3.12/site-packages/exllamav2/model.py:25 15 # # Set cudaMallocAsync allocator by default as it appears slightly more memory efficient, unless Torch is already 16 # # imported in which case changing the allocator would cause it to crash 17 # if not \"PYTORCH_CUDA_ALLOC_CONF\" in os.environ: (...) 20 # except NameError: 21 # os.environ[\"PYTORCH_CUDA_ALLOC_CONF\"] = \"backend:cudaMallocAsync\" 23 import torch ---> 25 if not (torch.version.cuda or torch.version.hip): 26 print(\"\") 27 print(f\" ## Warning: The installed version of PyTorch is {torch.version} and does not support CUDA or ROCm.\")

AttributeError: module 'torch' has no attribute 'version'" }

lovebeatz commented 2 weeks ago

Finally, wheel install worked manually installed pytorch 2,3.1 with cuda 12.1 and python 3.11 https://github.com/turboderp/exllamav2/releases/download/v0.1.5/exllamav2-0.1.5+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

tokenizers, rich require separate pip install

also, what's your say for tabby API? is it regularly updated? what would be the best way to serve exl2 models, via exllamav2 langchain integration or tabbyAPI via openAI langchain?

lovebeatz commented 2 weeks ago

Also, anything you can tell for using chatml prompt template via exllamav2 want to server hermes-2-pro/hermes-2-theta

lovebeatz commented 2 weeks ago

Finally, wheel install worked manually installed pytorch 2,3.1 with cuda 12.1 and python 3.11 https://github.com/turboderp/exllamav2/releases/download/v0.1.5/exllamav2-0.1.5+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

tokenizers, rich require separate pip install

also, what's your say for tabby API? is it regularly updated? what would be the best way to serve exl2 models, via exllamav2 langchain integration or tabbyAPI via openAI langchain?

so if anyone wants to go directly with wheel install without manual torch install, the workaround is here so whatever wheel one picks, only a certain torch/cuda version is install by default, so if one picks the wheel which matches the version the default torch install, it shows no error recently one that worked is, with python 3.11 and no manual torch install https://github.com/turboderp/exllamav2/releases/download/v0.1.5/exllamav2-0.1.5+cu121.torch2.3.1-cp311-cp311-linux_x86_64.whl

lovebeatz commented 1 week ago

also, how to use flash attention?

turboderp commented 1 week ago

Just install it, and it will be used by default. pip install flash-attn

lovebeatz commented 1 week ago

what's your say for tabby API? is it regularly updated? what would be the best way to serve exl2 models, via exllamav2 langchain integration or tabbyAPI via openAI langchain? & anything you can tell for using chatml prompt template via exllamav2 want to server hermes-2-pro/hermes-2-theta

turboderp commented 1 week ago

Tabby is still alive and well, getting frequent updates. I don't really have an opinion on LangChain as I've never used it (or found much use for it) but Tabby provides an OAI-compatible endpoint so you can use it with whatever frontend or framework supports that.