Closed kevinkaw closed 1 year ago
It seems like it's related to this issue (https://github.com/PromtEngineer/localGPT/issues/251). Can you try to downgrade auto-gptq to version 0.2.2 (pip install auto-gptq==0.2.2) and check if it works then?
Unfortunately, it still says CUDA extension is not installed. Then generates a local URL, but gives me an error message when I try to chat with it
D:\llama2_local>python llama.py --model_name="TheBloke/Llama-2-7b-Chat-GPTQ"
CUDA extension not installed.
The safetensors archive passed at C:\Users\Kevin/.cache\huggingface\hub\models--TheBloke--Llama-2-7b-Chat-GPTQ\snapshots\67960731b976925842e84dcaf1bbd693e58c449e\gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Running on local URL: http://127.0.0.1:7860
Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.
Traceback (most recent call last):
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\routes.py", line 439, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1389, in process_api
result = await self.call_function(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1108, in call_function
prediction = await utils.async_iteration(iterator)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 347, in async_iteration
return await iterator.__anext__()
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 340, in __anext__
return await anyio.to_thread.run_sync(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
result = context.run(func, *args)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 323, in run_sync_iterator_async
return next(iterator)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 692, in gen_wrapper
yield from f(*args, **kwargs)
File "D:\llama2_local\llama.py", line 85, in bot
inputs = tokenizer(instruction, return_tensors="pt").to(model.device)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_base.py", line 411, in device
device = [d for d in self.hf_device_map.values() if d not in {'cpu', 'disk'}][0]
IndexError: list index out of range
I have the same error, but I suppose that is because I am not using a nvidia gpu, there is a way to run it on AMD or intel?
Could you check if editing line 37 in the llama.py file helps? Instead of passing use_triton=False
pass use_triton=True
.
Could you check if editing line 37 in the llama.py file helps? Instead of passing
use_triton=False
passuse_triton=True
.
Nope, still says CUDA is not installed and gives an error when I try to chat with it.
CUDA extension not installed.
The safetensors archive passed at C:\Users\Kevin/.cache\huggingface\hub\models--TheBloke--Llama-2-7b-Chat-GPTQ\snapshots\67960731b976925842e84dcaf1bbd693e58c449e\gptq_model-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████| 727/727 [00:00<?, ?B/s]
Downloading tokenizer.model: 100%|██████████████████████████████████████████████████| 500k/500k [00:00<00:00, 11.1MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████| 1.84M/1.84M [00:00<00:00, 3.17MB/s]
Downloading (…)cial_tokens_map.json: 100%|█████████████████████████████████████████████| 411/411 [00:00<00:00, 395kB/s]
Running on local URL: http://127.0.0.1:7860
Could not create share link. Please check your internet connection or our status page: https://status.gradio.app.
Traceback (most recent call last):
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\routes.py", line 439, in run_predict
output = await app.get_blocks().process_api(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1389, in process_api
result = await self.call_function(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\blocks.py", line 1108, in call_function
prediction = await utils.async_iteration(iterator)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 347, in async_iteration
return await iterator.__anext__()
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 340, in __anext__
return await anyio.to_thread.run_sync(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\to_thread.py", line 33, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 877, in run_sync_in_worker_thread
return await future
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\anyio\_backends\_asyncio.py", line 807, in run
result = context.run(func, *args)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 323, in run_sync_iterator_async
return next(iterator)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\gradio\utils.py", line 692, in gen_wrapper
yield from f(*args, **kwargs)
File "D:\llama2_local\llama.py", line 85, in bot
inputs = tokenizer(instruction, return_tensors="pt").to(model.device)
File "C:\Users\Kevin\AppData\Local\Programs\Python\Python310\lib\site-packages\auto_gptq\modeling\_base.py", line 411, in device
device = [d for d in self.hf_device_map.values() if d not in {'cpu', 'disk'}][0]
IndexError: list index out of range
I managed to find the solution to this problem!
You must first uninstall auto-gptq to install it from source.
To do this, after uninstalling autp-gptq, run :
git clone https://github.com/PanQiWei/AutoGPTQ.git
cd AutoGPTQ
pip install -e
The source should now install autogptq_cuda automatically! Full details here: https://github.com/PanQiWei/AutoGPTQ#install-from-source.
But now i have this problem :') :
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 33554432 bytes.
It looks like 32Mo is loaded on the CPU/RAM, but with python llama.py --model_name="TheBloke/Llama-2-7b-Chat-GPTQ"
isn't it supposed to run on the GPU ? And even if they are stored in RAM, I have more than enough to accommodate 32MB... And if it's a graphic card model, what kind of graphic card i need ? Is there a way to run the model as much as possible on the gpu and the rest on the ram?
@kevinkaw & @SamiWh22 - I worked through some of these same issues and got it all working on both Linux & Windows machines locally (my Ubuntu Nvidia driver is still not stable).
>>> import torch
>>> torch.cuda.is_available() # this should return True.
With 32GB RAM and RTX A2000 8GB, it's still very slow with the basic (non-quantized versions). I'll need to put it on a server/HFSpace so I don't have to set this up every time and so it has the right performance. But outside of that it works well.
It finally works!!! Thanks @unmeshk75. Here's what happened
I ran it with the 7B-chat model, but unfortunately it is disappointingly slow. My GPU is a GTX1060 6GB, my CPU is an i5 7400 .... the cpu version runs faster in terms of generating words
I have been trying to run GPTQ models, but i'm getting the error below. Tried the following and nothing worked:
conda install -c nvidia cuda