CUDA extension not installed

kevinkaw commented 1 year ago

I have been trying to run GPTQ models, but i'm getting the error below. Tried the following and nothing worked:

enabled developer mode on windows 10
installed cuda from nvdia webpage
ran conda install -c nvidia cuda

CUDA extension not installed.
Traceback (most recent call last):
  File "D:\llama2_local\llama.py", line 114, in <module>
    fire.Fire(main)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\fire\core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "D:\llama2_local\llama.py", line 110, in main
    model, tokenizer = init_auto_model_and_tokenizer(model_name, model_type, file_name)
  File "D:\llama2_local\llama.py", line 55, in init_auto_model_and_tokenizer
    model, tokenizer = initialize_gpu_model_and_tokenizer(model_name, model_type=model_type)
  File "D:\llama2_local\llama.py", line 37, in initialize_gpu_model_and_tokenizer
    model = AutoGPTQForCausalLM.from_quantized(model_name, device_map="auto", use_safetensors=True, use_triton=False)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\auto.py", line 94, in from_quantized
    return quant_func(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_base.py", line 749, in from_quantized
    make_quant(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_utils.py", line 92, in make_quant
    make_quant(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_utils.py", line 92, in make_quant
    make_quant(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_utils.py", line 92, in make_quant
    make_quant(
  [Previous line repeated 1 more time]
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_utils.py", line 84, in make_quant
    new_layer = QuantLinear(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\nn_modules\qlinear\qlinear_cuda_old.py", line 83, in __init__
    self.autogptq_cuda = autogptq_cuda_256
NameError: name 'autogptq_cuda_256' is not defined

thisserand commented 1 year ago

It seems like it's related to this issue (https://github.com/PromtEngineer/localGPT/issues/251). Can you try to downgrade auto-gptq to version 0.2.2 (pip install auto-gptq==0.2.2) and check if it works then?

kevinkaw commented 1 year ago

Unfortunately, it still says CUDA extension is not installed. Then generates a local URL, but gives me an error message when I try to chat with it

D:\llama2_local>python llama.py --model_name="TheBloke/Llama-2-7b-Chat-GPTQ"
CUDA extension not installed.
The safetensors archive passed at C:\Users\Kevin/.cache\huggingface\hub\models--TheBloke--Llama-2-7b-Chat-GPTQ\snapshots\67960731b976925842e84dcaf1bbd693e58c449e\gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Running on local URL:  http://127.0.0.1:7860

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.
Traceback (most recent call last):
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\routes.py", line 439, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1389, in process_api
    result = await self.call_function(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1108, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 347, in async_iteration
    return await iterator.__anext__()
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 340, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 323, in run_sync_iterator_async
    return next(iterator)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 692, in gen_wrapper
    yield from f(*args, **kwargs)
  File "D:\llama2_local\llama.py", line 85, in bot
    inputs = tokenizer(instruction, return_tensors="pt").to(model.device)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_base.py", line 411, in device
    device = [d for d in self.hf_device_map.values() if d not in {'cpu', 'disk'}][0]
IndexError: list index out of range

YorozuyaDev commented 1 year ago

I have the same error, but I suppose that is because I am not using a nvidia gpu, there is a way to run it on AMD or intel?

thisserand commented 1 year ago

Could you check if editing line 37 in the llama.py file helps? Instead of passing use_triton=False pass use_triton=True.

kevinkaw commented 1 year ago

Could you check if editing line 37 in the llama.py file helps? Instead of passing use_triton=False pass use_triton=True.

Nope, still says CUDA is not installed and gives an error when I try to chat with it.

CUDA extension not installed.
The safetensors archive passed at C:\Users\Kevin/.cache\huggingface\hub\models--TheBloke--Llama-2-7b-Chat-GPTQ\snapshots\67960731b976925842e84dcaf1bbd693e58c449e\gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 727/727 [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████████████████████████████████████████████| 500k/500k [00:00<00:00, 11.1MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.17MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████| 411/411 [00:00<00:00, 395kB/s]
Running on local URL:  http://127.0.0.1:7860

Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.
Traceback (most recent call last):
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\routes.py", line 439, in run_predict
    output = await app.get_blocks().process_api(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1389, in process_api
    result = await self.call_function(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1108, in call_function
    prediction = await utils.async_iteration(iterator)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 347, in async_iteration
    return await iterator.__anext__()
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 340, in __anext__
    return await anyio.to_thread.run_sync(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
    return await future
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
    result = context.run(func, *args)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 323, in run_sync_iterator_async
    return next(iterator)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 692, in gen_wrapper
    yield from f(*args, **kwargs)
  File "D:\llama2_local\llama.py", line 85, in bot
    inputs = tokenizer(instruction, return_tensors="pt").to(model.device)
  File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_base.py", line 411, in device
    device = [d for d in self.hf_device_map.values() if d not in {'cpu', 'disk'}][0]
IndexError: list index out of range

SamiWh22 commented 1 year ago

I managed to find the solution to this problem!

You must first uninstall auto-gptq to install it from source.

To do this, after uninstalling autp-gptq, run :

git clone https://github.com/PanQiWei/AutoGPTQ.git
cd AutoGPTQ
pip install -e

The source should now install autogptq_cuda automatically! Full details here: https://github.com/PanQiWei/AutoGPTQ#install-from-source.

But now i have this problem :') :

RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 33554432 bytes.

It looks like 32Mo is loaded on the CPU/RAM, but with python llama.py --model_name="TheBloke/Llama-2-7b-Chat-GPTQ" isn't it supposed to run on the GPU ? And even if they are stored in RAM, I have more than enough to accommodate 32MB... And if it's a graphic card model, what kind of graphic card i need ? Is there a way to run the model as much as possible on the gpu and the rest on the ram?

unmeshk75 commented 1 year ago

@kevinkaw & @SamiWh22 - I worked through some of these same issues and got it all working on both Linux & Windows machines locally (my Ubuntu Nvidia driver is still not stable).

You should check if you have CUDA installed and then if the version of CUDA is the right one. Also on Windows ensure that you have CMake, MinGW64 and VS Build Tools installed for the wheels (Thanks, @kevinkaw!)
To check if CUDA is installed, first activate the llama2_local conda env and run pip install. Then on the Python prompt run
```
>>> import torch
>>> torch.cuda.is_available() # this should return True.
```
Secondly, the CUDA version you have should match the CUDA for the Pytorch we're using here, which is 11.7 Check this link and download the 11.7 CUDA version: https://developer.nvidia.com/cuda-11-7-0-download-archive
If you're using GPTQ quantized versions, check that it is installed. (I didn't have any issues with the 0.3.0 version with CUDA)
If you end up having multiple CUDA versions installed, make sure that you have different environment PATH variables for them, but CUDA_PATH is pointing to the accurate version matching the above (11.7 in my case).

With 32GB RAM and RTX A2000 8GB, it's still very slow with the basic (non-quantized versions). I'll need to put it on a server/HFSpace so I don't have to set this up every time and so it has the right performance. But outside of that it works well.

kevinkaw commented 1 year ago

It finally works!!! Thanks @unmeshk75. Here's what happened

I had the non-cuda version of pytorch, so i had to install version 11.7 from: https://pytorch.org/get-started/locally/
My cuda version was 12, so I had to install 11.7 from the nvidia site you provided

I ran it with the 7B-chat model, but unfortunately it is disappointingly slow. My GPU is a GTX1060 6GB, my CPU is an i5 7400 .... the cpu version runs faster in terms of generating words

thisserand / llama2_local

CUDA extension not installed #3