qwopqwop200 / GPTQ-for-LLaMa

4 bits quantization of LLaMA using GPTQ
Apache License 2.0
2.99k stars 459 forks source link

File "<string>", line 21, in matmul_248_kernel #204

Open moophlo opened 1 year ago

moophlo commented 1 year ago

I don't know if the error I'm getting is related to GPTQ-for-LLaMa, but is worth a try! This is the scenario: Linux Mint based ubuntu focal 2x AMD RX5700XT I'm using the oobabooga_linux on-click installer even if I know is not supported, I will attach my start_linux.sh and webui.py as they're modified to fit the purpose. Using rocm-5.4.2, installed packages: torch torchvision torchaudio pytorch-triton-rocm Installing bitsandbytes from here https://github.com/agrocylo/bitsandbytes-rocm.git and of course the triton branch of this repo!

Now doesn't matter which GPTQ model I try to run I get always the same error:

Gradio HTTP request redirected to localhost :)
/media/muflo/Volume10TB/AI/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/torch/cuda/__init__.py:546: UserWarning: Can't initialize NVML
  warnings.warn("Can't initialize NVML")
The following models are available:

1. eachadea_ggml-vicuna-7b-1.1
2. eachadea_vicuna-7b-1.1
3. EleutherAI_pythia-6.9b-deduped
4. facebook_opt-1.3b
5. TheBloke_vicuna-7B-GPTQ-4bit-128g
6. TheBloke_vicuna-AlekseyKorshuk-7B-GPTQ-4bit-128g
7. TheYuriLover_llama-13b-pretrained-sft-do2-4bit-128g

Which one do you want to load? 1-7

7

Loading TheYuriLover_llama-13b-pretrained-sft-do2-4bit-128g...
Found the following quantized model: models/TheYuriLover_llama-13b-pretrained-sft-do2-4bit-128g/llama-13b-pretrained-sft-do2-4bit-128g.safetensors
Loading model ...
The safetensors archive passed at models/TheYuriLover_llama-13b-pretrained-sft-do2-4bit-128g/llama-13b-pretrained-sft-do2-4bit-128g.safetensors does not contain metadata. Make sure to save your model with the `save_pretrained` method. Defaulting to 'pt' metadata.
Found 3 unique KN Linear values.
Warming up autotune cache ...
  0%|                                                                                                                                                         | 0/12 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "<string>", line 21, in matmul_248_kernel
KeyError: ('2-.-0-.-0--315d85ef1dfc7d1a34d7a2447be14452-d3d1a24a1000d576219bf70a59497878-acfe913a6e9f1719e65278db70e6193a-b7dcc787308646f52b470de5b5172fda-e1f133f98d04093da2078dfc51c36b72-5b0ee5cc97b40d70a6cbd97988cca4ae-0db1785b8dc43452c61ef6d926ec11bb-72c9e4c5c1a79179d65d3e9ae662c15b', (torch.float16, torch.int32, torch.float16, torch.float16, torch.int32, torch.int32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32'), (16, 256, 32, 8), (True, True, True, True, True, True, (False, True), (True, False), (True, False), (False, False), (False, False), (True, False), (False, True), (True, False), (False, True), (True, False), (False, True), (True, False), (True, False)), 4, 4)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/server.py", line 912, in <module>
    shared.model, shared.tokenizer = load_model(shared.model_name)
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/modules/models.py", line 127, in load_model
    model = load_quantized(model_name)
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/modules/GPTQ_loader.py", line 173, in load_quantized
    model = load_quant(str(path_to_model), str(pt_path), shared.args.wbits, shared.args.groupsize, shared.args.pre_layer)
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/llama_inference_offload.py", line 221, in load_quant
    quant.autotune_warmup_linear(model)
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 419, in autotune_warmup_linear
    matmul248(a, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/quant_linear.py", line 267, in matmul248
    matmul_248_kernel[grid](input, qweight, output, scales, qzeros, g_idx, input.shape[0], qweight.shape[1], input.shape[1], bits, maxq, input.stride(0), input.stride(1), qweight.stride(0),
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in run
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 90, in <dictcomp>
    timings = {config: self._bench(*args, config=config, **kwargs) for config in pruned_configs}
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 72, in _bench
    return triton.testing.do_bench(kernel_call, percentiles=(0.5, 0.2, 0.8), rep=40)
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/triton/testing.py", line 146, in do_bench
    fn()
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/text-generation-webui/repositories/GPTQ-for-LLaMa/quant/custom_autotune.py", line 67, in kernel_call
    self.fn.run(*args, num_warps=config.num_warps, num_stages=config.num_stages, **current)
  File "<string>", line 43, in matmul_248_kernel
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/triton/compiler.py", line 1944, in __getattribute__
    self._init_handles()
  File "/media/muflo/Volume10TB/AI/oobabooga_linux/installer_files/env/lib/python3.10/site-packages/triton/compiler.py", line 1930, in _init_handles
    mod, func, n_regs, n_spills = hip_utils.load_binary(self.metadata["name"], self.asm["hsaco_path"], self.shared, device)
SystemError: <built-in function load_binary> returned NULL without setting an exception

Done!

Am I doing something wrong or I'm just trying to do something that is impossible? P.S.: One thing that I noticed is that it seems to be missing a closing brachet! It's immediately after this "(16, 256, 32, 8)"

webui.py.txt start_linux.sh.txt