28 percent faster cuda GPTQ with both act-order and groupsize supported

iwalton3 commented 1 year ago

I have a cuda version of GPTQ that works with both act-order and groupsize enabled. It is roughly 28 percent faster than the triton version. This should fix a lot of compatibility problems people are having with the old cuda branch, specifically where it generates gibberish text.

You can find it here: https://github.com/iwalton3/GPTQ-for-LLaMa

It is based on: https://github.com/sterlind/GPTQ-for-LLaMa

You can set it up in the webui by doing the following:

mkdir -p repositories
cd repositories
git clone https://github.com/iwalton3/GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install .

Zhincore commented 1 year ago

I get the following error when loading a model:

Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.bias", "model.layers.0.self_attn.o_proj.bias", "model.layers.0.self_attn.q_proj.bias", "model.layers.0.self_attn.v_proj.bias", "model.layers.0.mlp.down_proj.bias", "model.layers.0.mlp.gate_proj.bias", "model.layers.0.mlp.up_proj.bias", "model.layers.1.self_attn.k_proj.bias", "model.layers.1.self_attn.o_proj.bias", "model.layers.1.self_attn.q_proj.bias", "model.layers.1.self_attn.v_proj.bias", "model.layers.1.mlp.down_proj.bias", "model.layers.1.mlp.gate_proj.bias", "model.layers.1.mlp.up_proj.bias", "model.layers.2.self_attn.k_proj.bias", "model.layers.2.self_attn.o_proj.bias", "model.layers.2.self_attn.q_proj.bias", "model.layers.2.self_attn.v_proj.bias", "model.layers.2.mlp.down_proj.bias", "model.layers.2.mlp.gate_proj.bias", "model.layers.2.mlp.up_proj.bias", "model.layers.3.self_attn.k_proj.bias", "model.layers.3.self_attn.o_proj.bias", "model.layers.3.self_attn.q_proj.bias", "model.layers.3.self_attn.v_proj.bias", "model.layers.3.mlp.down_proj.bias", "model.layers.3.mlp.gate_proj.bias", "model.layers.3.mlp.up_proj.bias", "model.layers.4.self_attn.k_proj.bias", "model.layers.4.self_attn.o_proj.bias", "model.layers.4.self_attn.q_proj.bias", "model.layers.4.self_attn.v_proj.bias", "model.layers.4.mlp.down_proj.bias", "model.layers.4.mlp.gate_proj.bias", "model.layers.4.mlp.up_proj.bias", "model.layers.5.self_attn.k_proj.bias", "model.layers.5.self_attn.o_proj.bias", "model.layers.5.self_attn.q_proj.bias", "model.layers.5.self_attn.v_proj.bias", "model.layers.5.mlp.down_proj.bias", "model.layers.5.mlp.gate_proj.bias", "model.layers.5.mlp.up_proj.bias", "model.layers.6.self_attn.k_proj.bias", "model.layers.6.self_attn.o_proj.bias", "model.layers.6.self_attn.q_proj.bias", "model.layers.6.self_attn.v_proj.bias", "model.layers.6.mlp.down_proj.bias", "model.layers.6.mlp.gate_proj.bias", "model.layers.6.mlp.up_proj.bias", "model.layers.7.self_attn.k_proj.bias", "model.layers.7.self_attn.o_proj.bias", "model.layers.7.self_attn.q_proj.bias", "model.layers.7.self_attn.v_proj.bias", "model.layers.7.mlp.down_proj.bias", "model.layers.7.mlp.gate_proj.bias", "model.layers.7.mlp.up_proj.bias", "model.layers.8.self_attn.k_proj.bias", "model.layers.8.self_attn.o_proj.bias", "model.layers.8.self_attn.q_proj.bias", "model.layers.8.self_attn.v_proj.bias", "model.layers.8.mlp.down_proj.bias", "model.layers.8.mlp.gate_proj.bias", "model.layers.8.mlp.up_proj.bias", "model.layers.9.self_attn.k_proj.bias", "model.layers.9.self_attn.o_proj.bias", "model.layers.9.self_attn.q_proj.bias", "model.layers.9.self_attn.v_proj.bias", "model.layers.9.mlp.down_proj.bias", "model.layers.9.mlp.gate_proj.bias", "model.layers.9.mlp.up_proj.bias", "model.layers.10.self_attn.k_proj.bias", "model.layers.10.self_attn.o_proj.bias", "model.layers.10.self_attn.q_proj.bias", "model.layers.10.self_attn.v_proj.bias", "model.layers.10.mlp.down_proj.bias", "model.layers.10.mlp.gate_proj.bias", "model.layers.10.mlp.up_proj.bias", "model.layers.11.self_attn.k_proj.bias", "model.layers.11.self_attn.o_proj.bias", "model.layers.11.self_attn.q_proj.bias", "model.layers.11.self_attn.v_proj.bias", "model.layers.11.mlp.down_proj.bias", "model.layers.11.mlp.gate_proj.bias", "model.layers.11.mlp.up_proj.bias", "model.layers.12.self_attn.k_proj.bias", "model.layers.12.self_attn.o_proj.bias", "model.layers.12.self_attn.q_proj.bias", "model.layers.12.self_attn.v_proj.bias", "model.layers.12.mlp.down_proj.bias", "model.layers.12.mlp.gate_proj.bias", "model.layers.12.mlp.up_proj.bias", "model.layers.13.self_attn.k_proj.bias", "model.layers.13.self_attn.o_proj.bias", "model.layers.13.self_attn.q_proj.bias", "model.layers.13.self_attn.v_proj.bias", "model.layers.13.mlp.down_proj.bias", "model.layers.13.mlp.gate_proj.bias", "model.layers.13.mlp.up_proj.bias", "model.layers.14.self_attn.k_proj.bias", "model.layers.14.self_attn.o_proj.bias", "model.layers.14.self_attn.q_proj.bias", "model.layers.14.self_attn.v_proj.bias", "model.layers.14.mlp.down_proj.bias", "model.layers.14.mlp.gate_proj.bias", "model.layers.14.mlp.up_proj.bias", "model.layers.15.self_attn.k_proj.bias", "model.layers.15.self_attn.o_proj.bias", "model.layers.15.self_attn.q_proj.bias", "model.layers.15.self_attn.v_proj.bias", "model.layers.15.mlp.down_proj.bias", "model.layers.15.mlp.gate_proj.bias", "model.layers.15.mlp.up_proj.bias", "model.layers.16.self_attn.k_proj.bias", "model.layers.16.self_attn.o_proj.bias", "model.layers.16.self_attn.q_proj.bias", "model.layers.16.self_attn.v_proj.bias", "model.layers.16.mlp.down_proj.bias", "model.layers.16.mlp.gate_proj.bias", "model.layers.16.mlp.up_proj.bias", "model.layers.17.self_attn.k_proj.bias", "model.layers.17.self_attn.o_proj.bias", "model.layers.17.self_attn.q_proj.bias", "model.layers.17.self_attn.v_proj.bias", "model.layers.17.mlp.down_proj.bias", "model.layers.17.mlp.gate_proj.bias", "model.layers.17.mlp.up_proj.bias", "model.layers.18.self_attn.k_proj.bias", "model.layers.18.self_attn.o_proj.bias", "model.layers.18.self_attn.q_proj.bias", "model.layers.18.self_attn.v_proj.bias", "model.layers.18.mlp.down_proj.bias", "model.layers.18.mlp.gate_proj.bias", "model.layers.18.mlp.up_proj.bias", "model.layers.19.self_attn.k_proj.bias", "model.layers.19.self_attn.o_proj.bias", "model.layers.19.self_attn.q_proj.bias", "model.layers.19.self_attn.v_proj.bias", "model.layers.19.mlp.down_proj.bias", "model.layers.19.mlp.gate_proj.bias", "model.layers.19.mlp.up_proj.bias", "model.layers.20.self_attn.k_proj.bias", "model.layers.20.self_attn.o_proj.bias", "model.layers.20.self_attn.q_proj.bias", "model.layers.20.self_attn.v_proj.bias", "model.layers.20.mlp.down_proj.bias", "model.layers.20.mlp.gate_proj.bias", "model.layers.20.mlp.up_proj.bias", "model.layers.21.self_attn.k_proj.bias", "model.layers.21.self_attn.o_proj.bias", "model.layers.21.self_attn.q_proj.bias", "model.layers.21.self_attn.v_proj.bias", "model.layers.21.mlp.down_proj.bias", "model.layers.21.mlp.gate_proj.bias", "model.layers.21.mlp.up_proj.bias", "model.layers.22.self_attn.k_proj.bias", "model.layers.22.self_attn.o_proj.bias", "model.layers.22.self_attn.q_proj.bias", "model.layers.22.self_attn.v_proj.bias", "model.layers.22.mlp.down_proj.bias", "model.layers.22.mlp.gate_proj.bias", "model.layers.22.mlp.up_proj.bias", "model.layers.23.self_attn.k_proj.bias", "model.layers.23.self_attn.o_proj.bias", "model.layers.23.self_attn.q_proj.bias", "model.layers.23.self_attn.v_proj.bias", "model.layers.23.mlp.down_proj.bias", "model.layers.23.mlp.gate_proj.bias", "model.layers.23.mlp.up_proj.bias", "model.layers.24.self_attn.k_proj.bias", "model.layers.24.self_attn.o_proj.bias", "model.layers.24.self_attn.q_proj.bias", "model.layers.24.self_attn.v_proj.bias", "model.layers.24.mlp.down_proj.bias", "model.layers.24.mlp.gate_proj.bias", "model.layers.24.mlp.up_proj.bias", "model.layers.25.self_attn.k_proj.bias", "model.layers.25.self_attn.o_proj.bias", "model.layers.25.self_attn.q_proj.bias", "model.layers.25.self_attn.v_proj.bias", "model.layers.25.mlp.down_proj.bias", "model.layers.25.mlp.gate_proj.bias", "model.layers.25.mlp.up_proj.bias", "model.layers.26.self_attn.k_proj.bias", "model.layers.26.self_attn.o_proj.bias", "model.layers.26.self_attn.q_proj.bias", "model.layers.26.self_attn.v_proj.bias", "model.layers.26.mlp.down_proj.bias", "model.layers.26.mlp.gate_proj.bias", "model.layers.26.mlp.up_proj.bias", "model.layers.27.self_attn.k_proj.bias", "model.layers.27.self_attn.o_proj.bias", "model.layers.27.self_attn.q_proj.bias", "model.layers.27.self_attn.v_proj.bias", "model.layers.27.mlp.down_proj.bias", "model.layers.27.mlp.gate_proj.bias", "model.layers.27.mlp.up_proj.bias", "model.layers.28.self_attn.k_proj.bias", "model.layers.28.self_attn.o_proj.bias", "model.layers.28.self_attn.q_proj.bias", "model.layers.28.self_attn.v_proj.bias", "model.layers.28.mlp.down_proj.bias", "model.layers.28.mlp.gate_proj.bias", "model.layers.28.mlp.up_proj.bias", "model.layers.29.self_attn.k_proj.bias", "model.layers.29.self_attn.o_proj.bias", "model.layers.29.self_attn.q_proj.bias", "model.layers.29.self_attn.v_proj.bias", "model.layers.29.mlp.down_proj.bias", "model.layers.29.mlp.gate_proj.bias", "model.layers.29.mlp.up_proj.bias", "model.layers.30.self_attn.k_proj.bias", "model.layers.30.self_attn.o_proj.bias", "model.layers.30.self_attn.q_proj.bias", "model.layers.30.self_attn.v_proj.bias", "model.layers.30.mlp.down_proj.bias", "model.layers.30.mlp.gate_proj.bias", "model.layers.30.mlp.up_proj.bias", "model.layers.31.self_attn.k_proj.bias", "model.layers.31.self_attn.o_proj.bias", "model.layers.31.self_attn.q_proj.bias", "model.layers.31.self_attn.v_proj.bias", "model.layers.31.mlp.down_proj.bias", "model.layers.31.mlp.gate_proj.bias", "model.layers.31.mlp.up_proj.bias", "model.layers.32.self_attn.k_proj.bias", "model.layers.32.self_attn.o_proj.bias", "model.layers.32.self_attn.q_proj.bias", "model.layers.32.self_attn.v_proj.bias", "model.layers.32.mlp.down_proj.bias", "model.layers.32.mlp.gate_proj.bias", "model.layers.32.mlp.up_proj.bias", "model.layers.33.self_attn.k_proj.bias", "model.layers.33.self_attn.o_proj.bias", "model.layers.33.self_attn.q_proj.bias", "model.layers.33.self_attn.v_proj.bias", "model.layers.33.mlp.down_proj.bias", "model.layers.33.mlp.gate_proj.bias", "model.layers.33.mlp.up_proj.bias", "model.layers.34.self_attn.k_proj.bias", "model.layers.34.self_attn.o_proj.bias", "model.layers.34.self_attn.q_proj.bias", "model.layers.34.self_attn.v_proj.bias", "model.layers.34.mlp.down_proj.bias", "model.layers.34.mlp.gate_proj.bias", "model.layers.34.mlp.up_proj.bias", "model.layers.35.self_attn.k_proj.bias", "model.layers.35.self_attn.o_proj.bias", "model.layers.35.self_attn.q_proj.bias", "model.layers.35.self_attn.v_proj.bias", "model.layers.35.mlp.down_proj.bias", "model.layers.35.mlp.gate_proj.bias", "model.layers.35.mlp.up_proj.bias", "model.layers.36.self_attn.k_proj.bias", "model.layers.36.self_attn.o_proj.bias", "model.layers.36.self_attn.q_proj.bias", "model.layers.36.self_attn.v_proj.bias", "model.layers.36.mlp.down_proj.bias", "model.layers.36.mlp.gate_proj.bias", "model.layers.36.mlp.up_proj.bias", "model.layers.37.self_attn.k_proj.bias", "model.layers.37.self_attn.o_proj.bias", "model.layers.37.self_attn.q_proj.bias", "model.layers.37.self_attn.v_proj.bias", "model.layers.37.mlp.down_proj.bias", "model.layers.37.mlp.gate_proj.bias", "model.layers.37.mlp.up_proj.bias", "model.layers.38.self_attn.k_proj.bias", "model.layers.38.self_attn.o_proj.bias", "model.layers.38.self_attn.q_proj.bias", "model.layers.38.self_attn.v_proj.bias", "model.layers.38.mlp.down_proj.bias", "model.layers.38.mlp.gate_proj.bias", "model.layers.38.mlp.up_proj.bias", "model.layers.39.self_attn.k_proj.bias", "model.layers.39.self_attn.o_proj.bias", "model.layers.39.self_attn.q_proj.bias", "model.layers.39.self_attn.v_proj.bias", "model.layers.39.mlp.down_proj.bias", "model.layers.39.mlp.gate_proj.bias", "model.layers.39.mlp.up_proj.bias".

iwalton3 commented 1 year ago

I pushed an update that speeds up startup with large inputs and may also correct that issue.

Tom-Neverwinter commented 1 year ago

no error, seems to have shaved off 4 seconds on initial commit.

INFO:Loading wizardLM-7B-HF... WARNING:Auto-assiging --gpu-memory 10 for your GPU to try to prevent out-of-memory errors. You can manually set other values. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:30<00:00, 15.17s/it] INFO:Replaced attention with sdp_attention INFO:Loaded the model in 30.78 seconds. Output generated in 12.12 seconds (1.82 tokens/s, 22 tokens, context 68, seed 1448524449)

testing on these specs: https://pastebin.com/wRVCpcep

i am running with: call pip install einops in my webui

Tom-Neverwinter commented 1 year ago

follow up for new commit:

INFO:Loading wizardLM-7B-HF... WARNING:Auto-assiging --gpu-memory 10 for your GPU to try to prevent out-of-memory errors. You can manually set other values. Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:30<00:00, 15.06s/it] INFO:Replaced attention with sdp_attention INFO:Loaded the model in 30.52 seconds. Output generated in 7.77 seconds (1.67 tokens/s, 13 tokens, context 48, seed 1465798046)

add more generations just in case:

Output generated in 7.77 seconds (1.67 tokens/s, 13 tokens, context 48, seed 1465798046) Output generated in 7.31 seconds (1.91 tokens/s, 14 tokens, context 81, seed 241244980) Output generated in 9.08 seconds (1.87 tokens/s, 17 tokens, context 116, seed 1346762798)

testing on these specs: https://pastebin.com/wRVCpcep

initial loading is slower? [open oobabooga webui was much slower? ] i am running with: call pip install einops in my webui

based on my math this is a 25% improvement! 16 seconds - 12 seconds = 4 seconds so (4 seconds / 16 seconds) * 100 = 25%

iwalton3 commented 1 year ago

It might be that the initial opening is slower. It adds the kernel switch optimization back. It makes it so that if you start with a huge prompt it doesn't sit there for 20 seconds before generating anything.

Tom-Neverwinter commented 1 year ago

only remaining issue I see:

Traceback (most recent call last):
File “C:\Users\Tom_N\Desktop\oobabooga-windows\oobabooga-windows\text-generation-webui\server.py”, line 59, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “C:\Users\Tom_N\Desktop\oobabooga-windows\oobabooga-windows\text-generation-webui\modules\models.py”, line 157, in load_model
from modules.GPTQ_loader import load_quantized
File “C:\Users\Tom_N\Desktop\oobabooga-windows\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py”, line 15, in
import llama_inference_offload
File “C:\Users\Tom_N\Desktop\oobabooga-windows\oobabooga-windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py”, line 1, in
from gptq_llama.llama_inference_offload import *
ModuleNotFoundError: No module named ‘gptq_llama’

iwalton3 commented 1 year ago

I updated it again to make sure the faster kernel gets used. Note you need to do the pip install . command before you can use this branch.

Zhincore commented 1 year ago

Now it loads! Pretty quick too without triton doing it's thing. But I get Offload_LlamaModel.forward() got an unexpected keyword argument 'position_ids' when trying to generate, though.

iwalton3 commented 1 year ago

Update I patched that issue and can confirm that CPU offloading works. I also tested and confirmed my version works with the --monkey-patch flag to load loras alongside a GPTQ model! (I have not tested training loras with this. Note that it does NOT seem to support cpu offloading while using a lora.)

iwalton3 commented 1 year ago

Update I verified that training LoRAs also works with this branch. I did need to use half the batch size than I would to train the same model in 8bit mode though, so I don't think it is super efficient.

bkutasi commented 1 year ago

Great work! I love to see speed bumps. Will try to test out today and report back. How is cross-platform support? I presume win and Linux are both supported? What is your base speed with 4bit 7b -13b models with 128 groupsize?

bkutasi commented 1 year ago

I have to report that I have a compilation error when trying to get this to run/setup. ///https://github.com/sterlind/GPTQ-for-LLaMa works like a breeze/// partially completes, doesnt work after all. I'm on win 10, probably its a problem with c++ build tools. Is the windows installation confirmed to work?

iwalton3 commented 1 year ago

I have only tested in Linux and WSL so far. What exact error do you get? Note that I couldn't get the original GPTQ to work from the automatic installer on Windows without WSL. (It would crash after loading with no error message.)

city96 commented 1 year ago

It seems to run into the following compile error for me on my Pascal-based card (Tesla P40 with Compute 6.1) on Ubuntu 22.04.2 LTS:

GPTQ-for-LLaMa/src/gptq_llama/quant_cuda/quant_cuda_kernel.cu(977): error: no instance of overloaded function "atomicAdd" matches the argument list
    argument types are: (__half *, c10::Half)
    detected during instantiation of "void VecQuant4MatMulKernelFaster(const half2 *, const int *, scalar_t *, const scalar_t *, const int *, const int *, int, int, int, int, int) [with scalar_t=c10::Half]"

Here is the full log for pip install .

iwalton3 commented 1 year ago

What happens if you built it with this code added? https://forums.developer.nvidia.com/t/atomicadd-not-overloaded-for-c10-half/204474/5

city96 commented 1 year ago

As far as I can tell that code is already present in /src/gptq_llama/quant_cuda/quant_cuda_kernel.cu. Removing the #ifdef so it always runs also doesn't change anything. I'm not too well versed in c++, but could the problem be that the override is for (c10::Half *, c10::Half) whereas the error occurs when trying to atomicAdd the types (__half *, c10::Half)?

Edit: That's probably it, replacing the input of that function with atomicAdd(__half* address, c10::Half val) makes it compile, but the model outputs gibberish, meaning there should probably be a typecast/conversion from __half to c10::Half somewhere.

iwalton3 commented 1 year ago

Try the version I just pushed @city96

city96 commented 1 year ago

It seems to be working, thank you! @iwalton3

DocShotgun commented 1 year ago

This fork seems to work well for me on 4bit quantized llama models such as the 4bit pyg7b and wizardlb7b, and it is significantly faster than the other up-to-date cuda branch,

However, I can't seem to get it to work with this 4bit quantized gptj model: https://huggingface.co/OccamRazor/pygmalion-6b-gptq-4bit/

I have loaded the model using the settings wbits 4, groupsize 128, model_type gptj, and while it doesn't throw any errors on load, it cannot perform inference at all. When performing inference through notebook mode in webui, it just generates 0 tokens without a visible error. When using with SillyTavern through api, it throws a tensor size mismatch error.

When I revert back to the standard GPTQ that comes with the webui one-click installer (https://github.com/oobabooga/GPTQ-for-LLaMa/), the model loads and works as expected with the same settings, which is why I think it is related to the changes made here. Unless it is intended behavior that this GPTQ library does not support gptj.

iwalton3 commented 1 year ago

@DocShotgun Try commenting out these lines and running it: https://github.com/iwalton3/GPTQ-for-LLaMa/blob/59b71733af0f14ba9f3a7a81eb4e08a2cc772264/src/gptq_llama/quant.py#L373-L374

I want to know if it is caused by the faster kernel or something else.

DocShotgun commented 1 year ago

@iwalton3 Commenting out those two lines and rebuilding unfortunately did not fix it.

For reference this is the error that gets spit out during inference:

Traceback (most recent call last):
  File "C:\Users\X\Documents\Oobabooga\text-generation-webui\modules\text_generation.py", line 242, in generate_reply_HF
    output = shared.model.generate(**generate_params)[0]
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1565, in generate
    return self.sample(
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2612, in sample
    outputs = self(
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\models\gptj\modeling_gptj.py", line 853, in forward
    transformer_outputs = self.transformer(
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\models\gptj\modeling_gptj.py", line 688, in forward
    outputs = block(
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\models\gptj\modeling_gptj.py", line 309, in forward
    attn_outputs = self.attn(
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\models\gptj\modeling_gptj.py", line 213, in forward
    value = self._split_heads(value, self.num_attention_heads, self.head_dim, False)
  File "C:\Users\X\Documents\Oobabooga\installer_files\env\lib\site-packages\transformers\models\gptj\modeling_gptj.py", line 130, in _split_heads
    raise ValueError(f"Input tensor rank should be one of [4, 5], but is: {len(tensor.shape)}")
ValueError: Input tensor rank should be one of [4, 5], but is: 3
Output generated in 0.01 seconds (0.00 tokens/s, 0 tokens, context 816, seed 363872350)

For some reason it's only popping out when I hook into API with SillyTavern. In oobabooga itself, the inference just fails silently and generates 0 tokens.

zslittlehelper commented 1 year ago

@iwalton3 I'm using WSL2 and followed the guide to install iwalton3's branch of GPTQ, but the moment I use monkey-patch, it seems to ignore pre_layer. I've been trying to make a LoRA on a 30b model, and without off-loading at least a few layers, I constantly run out of memory. Any suggestions whether any recent changes may have broken the functionality to use both monkey and cpu offloading at the same time?

Zhincore commented 1 year ago

the moment I use monkey-patch, it seems to ignore pre_layer

Did you not have that issue on other branches? I thought that's just the monkey-patch bypassing these features

zslittlehelper commented 1 year ago

Did you not have that issue on other branches? I thought that's just the monkey-patch bypassing these features

I did, but from the sound of the above discussion, (at least at some point) the iwalton3 did work with monkey-patch AND pre_layer.

iwalton3 commented 1 year ago

Both never worked for me. If you enable monkey patch with my branch, pre-layer is ignored.

zslittlehelper commented 1 year ago

Ah, apologies for misunderstanding. At present, do we know of any way to train LoRAs with GPTQ and offloading parts to CPU?

GGML training doesn't work at all as it's not supported, and loading 30b (8bit) in ram is incredibly slow in both inference and training.

Zhincore commented 1 year ago

I think we have to wait until monkey-patch is no longer required once LoRA is supported by the upstream libraries or something

cebtenzzre commented 1 year ago

This fails to compile for my GTX 970 (compute_52/sm_52). Briefly (duplicate errors omitted):

quant_cuda_kernel.cu(63): error: no suitable conversion function from "__half_raw" to "int" exists
      half tmpres = __hadd(hsum, val);
                           ^

quant_cuda_kernel.cu(706): error: identifier "__hfma2" is undefined
      res2 = __hfma2(__hfma2(deq2[(tmp >> 0) & 0xf][off], scale, zero), blockvec[k + 0], res2);
                     ^

quant_cuda_kernel.cu(989): error: no suitable conversion function from "__half" to "int" exists
      res = __hadd(res, __hadd(res2.x, res2.y));;
                               ^

quant_cuda_kernel.cu(975): error: identifier "__hmul" is undefined
          __hmul(-scale_f, __int2half_rn(((as_unsigned(zeros[g * zero_width + z_w]) >> z_mod) & 0xF) + 1)),
          ^

And here is a more complete portion of the log. Edit: I've been using qwopqwop200's GPTQ instead of the oobabooga one for similar reasons, see #566 and #593.

github-actions[bot] commented 11 months ago

This issue has been closed due to inactivity for 30 days. If you believe it is still relevant, please leave a comment below.

oobabooga / text-generation-webui

28 percent faster cuda GPTQ with both act-order and groupsize supported #1884