Closed Titaniumtown closed 1 year ago
How did you load LLaMA-13B into a 16GB GPU without 8-bit?
How did you load LLaMA-13B into a 16GB GPU without 8-bit?
using --auto-devices
13b/20b models are loaded in 8-bit mode by default (when no flags are specified) because they are too large to fit in consumer GPUs.
--auto-devices
disables this default behavior without the need for any manual changes to the code.
Fixed it, got 8-bit working, had to update bitsandbytes-rocm to use rocm 5.4.0 https://github.com/Titaniumtown/bitsandbytes-rocm/tree/patch-1 sent in a pull request. https://github.com/broncotc/bitsandbytes-rocm/pull/4
Edit: seems that the 6900xt itself has issues with int8 which this fork (https://github.com/0cc4m/bitsandbytes-rocm/tree/rocm) seems to try and address, but it has it's own issues. Doing some investigation.
Edit 2: relates to this issue (https://github.com/TimDettmers/bitsandbytes/issues/165)
Edit 3: turns out it's something wrong with the generation settings? It only seems to fail when using the "NovelAI Sphinx Moth" preset among others.
Nice @Titaniumtown, thanks for the update.
@oobabooga do you understand anything about what could be causing the generation issues? It seems to only be the case with specific combinations of generation settings.
What error appears when you use sphinx moth? This is a preset with high temperature and small top_k and top_p for creative but coherent outputs.
0%| | 0/26 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/gradio/routes.py", line 374, in run_predict
output = await app.get_blocks().process_api(
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/gradio/blocks.py", line 1017, in process_api
result = await self.call_function(
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/gradio/blocks.py", line 849, in call_function
prediction = await anyio.to_thread.run_sync(
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/gradio/utils.py", line 453, in async_iteration
return next(iterator)
File "/var/home/riley/text-generation-webui/modules/text_generation.py", line 188, in generate_reply
output = eval(f"shared.model.generate({', '.join(generate_params)}){cuda}")[0]
File "<string>", line 1, in <module>
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/transformers/generation/utils.py", line 1452, in generate
return self.sample(
File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/transformers/generation/utils.py", line 2504, in sample
next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)
RuntimeError: probability tensor contains either `inf`, `nan` or element < 0
I get this error when I try to use 8-bit mode in my GTX 1650. It's an upstream issue in the bitsandbytes library, as you found.
Ah, so there's nothing I can do about it. Sad. Thanks!
Change the 8bit threshold. It will probably help on AMD as well. I cannot test because my old card doesn't work with rocm due to AGP 2.0. It only works in windows.
@Ph0rk0z I just use 4bit models now. Works like a dream and has much better performance.
@Titaniumtown can you share how to use 4bit model for AMD GPU? I was looking at https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model, but Step 1: Installation for GPTQ-for-LLaMa requires CUDA?
@Titaniumtown can you share how to use 4bit model for AMD GPU? I was looking at https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model, but Step 1: Installation for GPTQ-for-LLaMa requires CUDA?
It does not require cuda. rocm works just fine. I just ran the script like Nvidia users do and it worked perfectly.
@Titaniumtown can you share how to use 4bit model for AMD GPU? I was looking at https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model, but Step 1: Installation for GPTQ-for-LLaMa requires CUDA?
It does not require cuda. rocm works just fine. I just ran the script like Nvidia users do and it worked perfectly.
Thank you! I will give it a try
@Titaniumtown can you share how to use 4bit model for AMD GPU? I was looking at https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model, but Step 1: Installation for GPTQ-for-LLaMa requires CUDA?
It does not require cuda. rocm works just fine. I just ran the script like Nvidia users do and it worked perfectly.
@Titaniumtown I tried to set things up and run just like the guide explains. I mean, as you said to just run the script like an Nvidia user would. And I get errors about missing headers when running _"python setupcuda.py install": https://github.com/oobabooga/text-generation-webui/issues/487
Could you help me? Am I missing something important? I'm new to all this stuff btw. I'm sure I am not understanding, or missing, something.
@VivaPeron do you have cuda installed?
Getting these errors when trying to to compile GPTQ-for-LLaMA
home/viliger/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_hip_kernel.hip:653:10: error: use of overloaded operator '=' is ambiguous (with operand types 'half2' (aka '__half2') and 'void')
res2 = {};
/home/viliger/text-generation-webui/repositories/GPTQ-for-LLaMa/quant_hip_kernel.hip:665:12: error: no matching function for call to '__half2float'
res += __half2float(res2.x) + __half2float(res2.y);
8bit model runs fine, once I got bitsandbytes-rocm installed. Also attached full log of compilation, output.txt
@viliger2 @VivaPeron this seems to be caused by GPTQ-for-LLaMA commits after 841feed using fp16 types. HIP doesn't seem to handle some implicit casts as far as I can tell. Rolling back to that commit results in successful compilation.
@VivaPeron do you have cuda installed?
Yes.
@viliger2 @VivaPeron this seems to be caused by GPTQ-for-LLaMA commits after 841feed using fp16 types. HIP doesn't seem to handle some implicit casts as far as I can tell. Rolling back to that commit results in successful compilation.
Thanks a lot! I will try this today when I get hom from work, and let you guys know.
Btw, these are my PC specs: Xeon E5-2620v2, 16GB ECC DDR3 RAM, AMD RX6600 8GB.
@arctic-marmoset Thanks!
@arctic-marmoset wow, thanks a lot!! Will try your fork today! At last I can put my 6600 to do useful work lol
I also have a fork of the repo with my changes here.
I tried your repo and got this error:
No ROCm runtime is found, using ROCM_HOME='/opt/rocm-5.4.3'
No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
running install
/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/setuptools/command/install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/setuptools/command/easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
warnings.warn(
running bdist_egg
running egg_info
writing quant_cuda.egg-info/PKG-INFO
writing dependency_links to quant_cuda.egg-info/dependency_links.txt
writing top-level names to quant_cuda.egg-info/top_level.txt
/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/utils/cpp_extension.py:476: UserWarning: Attempted to use ninja as the BuildExtension backend but we could not find ninja.. Falling back to using the slow distutils backend.
warnings.warn(msg.format('we could not find ninja.'))
reading manifest file 'quant_cuda.egg-info/SOURCES.txt'
writing manifest file 'quant_cuda.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_ext
building 'quant_cuda' extension
gcc -pthread -B /home/christopher/miniconda3/envs/gptq/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /home/christopher/miniconda3/envs/gptq/include -I/home/christopher/miniconda3/envs/gptq/include -fPIC -O2 -isystem /home/christopher/miniconda3/envs/gptq/include -fPIC -I/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include -I/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/torch/csrc/api/include -I/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/TH -I/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/include/THC -I/usr/local/cuda/include -I/home/christopher/miniconda3/envs/gptq/include/python3.9 -c quant_cuda.cpp -o build/temp.linux-x86_64-cpython-39/quant_cuda.o -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE=\"_gcc\" -DPYBIND11_STDLIB=\"_libstdcpp\" -DPYBIND11_BUILD_ABI=\"_cxxabi1011\" -DTORCH_EXTENSION_NAME=quant_cuda -D_GLIBCXX_USE_CXX11_ABI=0 -std=c++17
/home/christopher/miniconda3/envs/gptq/lib/python3.9/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML
warnings.warn("Can't initialize NVML")
Traceback (most recent call last):
File "/home/christopher/GPTQ-for-LLaMA-fork-amd/GPTQ-for-LLaMa-hip/setup_cuda.py", line 12, in
@VivaPeron it seems that you have issues with your ROCm installation, check if you have it installed or have version different from 5.4.3
(textgen) root@gribaai:~/text-generation-webui# python server.py --model llama-13b-4bit-128g --wbits 4 --groupsize 128
CUDA SETUP: Loading binary /root/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libsbitsandbytes_cpu.so...
Loading llama-13b-4bit-128g...
CUDA extension not installed.
Traceback (most recent call last):
File "/root/text-generation-webui/server.py", line 276, in <module>
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/root/text-generation-webui/modules/models.py", line 102, in load_model
model = load_quantized(model_name)
File "/root/text-generation-webui/modules/GPTQ_loader.py", line 114, in load_quantized
model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, kernel_switch_threshold=threshold)
File "/root/text-generation-webui/modules/GPTQ_loader.py", line 36, in _load_quant
make_quant(model, layers, wbits, groupsize, faster=faster_kernel, kernel_switch_threshold=kernel_switch_threshold)
TypeError: make_quant() got an unexpected keyword argument 'faster'
(textgen) root@gribaai:~/text-generation-webui#
how do I install CUDA extension with amd gpu?
if i do this "python setup_cuda.py install" inside "GPTQ-for-LLaMa" folder it returns me this error:
............
File "/root/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1780, in _get_cuda_arch_flags
arch_list[-1] += '+PTX'
IndexError: list index out of range
@belqit Please see @viliger2's comment above. You'll need to install ROCm 5.4.3.
@belqit Please see @viliger2's comment above. You'll need to install ROCm 5.4.3.
I have a 6900xt and I tried to load the LLaMA-13B model, I ended up getting this error:
going into
modules/models.py
and setting "load_in_8bit" to False fixed it, but this should work by default.