turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.53k stars 273 forks source link

TypeError during exllamav2 model quantization #341

Closed 1PercentSync closed 7 months ago

1PercentSync commented 7 months ago

Environment

Issue Description

During the quantization process of a model (https://huggingface.co/CausalLM/7B) using exllamav2, I encountered a TypeError in the make_q_matrix function.

Steps to Reproduce

  1. Converted a model to Hugging Face format with the following code:
    
    from transformers import AutoTokenizer, AutoModelForCausalLM

access_token = "tokenxxx" tokenizer = AutoTokenizer.from_pretrained("CausalLM/7B", token=access_token) model = AutoModelForCausalLM.from_pretrained("CausalLM/7B", token=access_token)

save_directory = "D:/Github/7B/hfc" tokenizer.save_pretrained(save_directory) model.save_pretrained(save_directory)


2. Ran the quantization command:

python convert.py -i D:\Github\7B\hfc -o D:\Github\7B\exl -cf D:\Github\7B\exlo -b 4.0


### Expected Behavior
The model should be quantized successfully without any errors.

### Actual Behavior
The quantization process failed with the following error message:

-- Quantizing... -- Layer: model.layers.0 (Attention) -- Linear: model.layers.0.self_attn.q_proj -> 0.1:5b_32g/0.9:4b_32g s4, 4.23 bpw Traceback (most recent call last): File "D:\Portable Program Files\Exllama\convert.py", line 253, in quant(job, save_job, model) File "D:\Portable Program Files\Exllama\venv\Lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\Portable Program Files\Exllama\conversion\quantize.py", line 329, in quant quant_attn(job, module, hidden_states, target_states, quantizers, cache, attn_params, strat) File "D:\Portable Program Files\Exllama\conversion\quantize.py", line 124, in quant_attn quant_linear(job, module.q_proj, quantizers["q_proj"], strat["q_proj"]) File "D:\Portable Program Files\Exllama\conversion\quantize.py", line 80, in quant_linear recons_linear.load(recons_dict) File "D:\Portable Program Files\Exllama\exllamav2\linear.py", line 55, in load self.q_handle = ext.make_q_matrix(w, self.temp_dq) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Portable Program Files\Exllama\exllamav2\ext.py", line 210, in make_q_matrix return ext_c.make_q_matrix(w["q_weight"], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ TypeError: make_q_matrix(): incompatible function arguments. The following argument types are supported:

  1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: torch.Tensor, arg6: torch.Tensor, arg7: torch.Tensor, arg8: torch.Tensor, arg9: torch.Tensor, arg10: torch.Tensor) -> int

Invoked with: tensor([[ -317331462, 1634642985, -1583586337, ..., -351809069, 380457976, 450481114], [-1958358436, -1563045924, -1464445553, ..., -1705319801, -372846963, -389099899], [ 1064008813, -1026139110, 846028031, ..., 1122602223, 1039037280, 1045457954], ..., [ 2053671334, -962025128, 2089217003, ..., 769383992, -923224007, -373887588], [ 2022178344, -1735944105, 1720110888, ..., 2054450329, -1717007240, -1717007513], [ 1989703833, -1688688506, 1704434615, ..., -1735870346, -1989625178, -1988560219]], device='cuda:0', dtype=torch.int32), tensor([1503, 237, 3489, ..., 3461, 962, 3195], device='cuda:0', dtype=torch.int16), tensor([1117, 4037, 653, ..., 2500, 2814, 2507], device='cuda:0', dtype=torch.int16), tensor([[-1970767208, 1736927334, -1198823514, ..., -2053809785, -1771604359, -1721206939], [ -927163670, -1749386361, -1985373992, ..., 2021025926, 2037881209, -2006353529], [-1716745544, -2021103241, -2004191305, ..., 1986418824, 2019985256, 2002286950], ..., [-1464296743, -1985377913, -1986422889, ..., 2004318377, -1987536775, -1735812454], [-1194628408, 2040101479, -2021099640, ..., 1986422648, 2005370744, 1736931175], [-1465210133, -1951749752, -1968662343, ..., 1753844104, -2021161081, -1988523641]], device='cuda:0', dtype=torch.int32), tensor([1.4985e-04, 1.0347e-04, 1.0413e-04, 9.6381e-05, 9.4593e-05, 7.9751e-05, 9.6023e-05, 8.0705e-05, 7.8022e-05, 1.0461e-04, 8.8155e-05, 9.9063e-05, 9.4473e-05, 1.6248e-04, 1.9956e-04, 2.3210e-04, 1.9240e-04, 1.8895e-04, 1.5152e-04, 1.7917e-04, 1.3471e-04, 1.8966e-04, 1.5247e-04, 2.1207e-04, 1.8322e-04, 1.4448e-04, 1.4055e-04, 1.8167e-04, 1.6332e-04, 1.8728e-04, 1.8990e-04, 1.7083e-04, 1.3828e-04, 1.4675e-04, 1.7118e-04, 1.4520e-04, 1.4150e-04, 1.5342e-04, 1.7571e-04, 1.7285e-04, 1.9681e-04, 1.8334e-04, 1.7738e-04, 1.6725e-04, 1.3447e-04, 1.5736e-04, 1.8930e-04, 1.3983e-04, 1.3816e-04, 1.5295e-04, 1.5652e-04, 1.9932e-04, 1.8632e-04, 2.0337e-04, 1.8811e-04, 1.3268e-04, 1.5378e-04, 1.4925e-04, 1.4031e-04, 1.3614e-04, 1.6689e-04, 1.3101e-04, 1.3638e-04, 1.6689e-04, 1.4853e-04, 1.3840e-04, 1.5950e-04, 1.6057e-04, 1.7571e-04, 1.3435e-04, 1.7631e-04, 1.9944e-04, 1.5175e-04, 1.6367e-04, 1.4675e-04, 1.3506e-04, 1.6403e-04, 1.2612e-04, 1.5211e-04, 1.5163e-04, 1.7297e-04, 1.3137e-04, 1.3852e-04, 1.8013e-04, 1.8096e-04, 1.7011e-04, 1.2207e-04, 1.4293e-04, 1.3673e-04, 1.4770e-04, 1.5461e-04, 1.5700e-04, 1.5223e-04, 1.4746e-04, 1.5974e-04, 1.5044e-04, 1.3125e-04, 1.5330e-04, 1.6236e-04, 1.3566e-04, 1.6785e-04, 1.1849e-04, 1.4615e-04, 1.5748e-04, 1.4389e-04, 1.3697e-04, 1.7250e-04, 1.6499e-04, 1.5664e-04, 1.3304e-04, 1.2720e-04, 1.7822e-04, 1.6141e-04, 1.7834e-04, 1.3244e-04, 1.8013e-04, 1.0818e-04, 1.2743e-04, 1.1837e-04, 1.5020e-04, 1.2362e-04, 1.3781e-04, 1.5330e-04, 1.6141e-04, 1.2648e-04, 1.2624e-04, 1.6856e-04, 1.1939e-04], device='cuda:0', dtype=torch.float16), tensor([ 5, 0, 5, 5, 5, 10, 5, 15, 5, 20, 5, 25, 5, 30, 5, 35, 5, 40, 5, 45, 5, 50, 5, 55, 5, 60, 4, 65, 4, 69, 4, 73, 4, 77, 4, 81, 4, 85, 4, 89, 4, 93, 4, 97, 4, 101, 4, 105, 4, 109, 4, 113, 4, 117, 4, 121, 4, 125, 4, 129, 4, 133, 4, 137, 4, 141, 4, 145, 4, 149, 4, 153, 4, 157, 4, 161, 4, 165, 4, 169, 4, 173, 4, 177, 4, 181, 4, 185, 4, 189, 4, 193, 4, 197, 4, 201, 4, 205, 4, 209, 4, 213, 4, 217, 4, 221, 4, 225, 4, 229, 4, 233, 4, 237, 4, 241, 4, 245, 4, 249, 4, 253, 4, 257, 4, 261, 4, 265, 4, 269, 4, 273, 4, 277, 4, 281, 4, 285, 4, 289, 4, 293, 4, 297, 4, 301, 4, 305, 4, 309, 4, 313, 4, 317, 4, 321, 4, 325, 4, 329, 4, 333, 4, 337, 4, 341, 4, 345, 4, 349, 4, 353, 4, 357, 4, 361, 4, 365, 4, 369, 4, 373, 4, 377, 4, 381, 4, 385, 4, 389, 4, 393, 4, 397, 4, 401, 4, 405, 4, 409, 4, 413, 4, 417, 4, 421, 4, 425, 4, 429, 4, 433, 4, 437, 4, 441, 4, 445, 4, 449, 4, 453, 4, 457, 4, 461, 4, 465, 4, 469, 4, 473, 4, 477, 4, 481, 4, 485, 4, 489, 4, 493, 4, 497, 4, 501, 4, 505, 4, 509, 4, 513, 4, 517, 4, 521], device='cuda:0', dtype=torch.int16), tensor([ 0, 32, 0, ..., 2, 127, 1], device='cuda:0', dtype=torch.int16), tensor(..., device='meta', size=(1, 1)), tensor(..., device='meta', size=(1, 1)), tensor(..., device='meta', size=(1, 1)), tensor(..., device='meta', size=(1, 1)), tensor([0., 0., 0., ..., 0., 0., 0.], device='cuda:0', dtype=torch.float16)



Any help would be appreciated.
turboderp commented 7 months ago

This usually happens when there's a version mismatch between ExLlama's C++ extension and the version of ExLlama you're actually using.

The latest release version, which you appear to have installed, is 0.0.13.post2, and Qwen support was added after that (I'm assuming the model you're trying to convert is a Qwen model). You'll have to build from source or wait for the 0.0.14 release, which should be soon. To build from source:

pip uninstall exllamav2
pip install .
1PercentSync commented 7 months ago

This usually happens when there's a version mismatch between ExLlama's C++ extension and the version of ExLlama you're actually using.

The latest release version, which you appear to have installed, is 0.0.13.post2, and Qwen support was added after that (I'm assuming the model you're trying to convert is a Qwen model). You'll have to build from source or wait for the 0.0.14 release, which should be soon. To build from source:

pip uninstall exllamav2
pip install .

I encountered an error when trying to build from the source code, but I forked the project and used action to build it. Now it's working again. Thank you for your response.