turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.28k stars 243 forks source link

Error quantizing models on recent commit #213

Closed brucethemoose closed 7 months ago

brucethemoose commented 7 months ago

I am getting an error quantizing a model with a command that worked about 9 days ago, using the same measurements json created 9 days ago.

Gonna roll back the commits and try to find the one that fixes it.

Command used:

python convert.py --in_dir /home/alpha/Storage/Models/Raw/CapyTessBorosYi-34B-200K-DARE-Ties -o /home/alpha/FastModels/scratch -m /home/alpha/FastModels/capytessborosmes.json --cal_dataset /home/alpha/Documents/medium.parquet -l 2048 -r 200 -ml 2048 -mr 40 -gr 200 -ss 4096 -b 3.1 -hb 6 -cf /home/alpha/FastModels/CapyTessBorosYi-34B-200K-DARE-Ties-exl2-31bpw-fiction -nr

Log:

...
 -- Token embeddings again...
 -- Quantizing...
 -- Layer: model.layers.0 (Attention)
Traceback (most recent call last):
  File "/home/alpha/AI/exllamav2/convert.py", line 300, in <module>
    quant(job, save_job, model)
  File "/home/alpha/AI/exllamav2/conversion/quantize.py", line 586, in quant
    outputs = module.forward(x, cache, attn_mask, intermediates = True)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/exllamav2/exllamav2/attn.py", line 247, in forward
    return self.forward_torch(hidden_states, cache, attn_mask, past_len, intermediates, loras = loras, position_offsets = position_offsets)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/alpha/AI/exllamav2/exllamav2/attn.py", line 554, in forward_torch
    ext_c.rope_(query_states, constants.sin, constants.cos, past_len, num_attention_heads, head_dim, offset_tensor)
TypeError: rope_(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: int, arg4: int, arg5: int) -> None

Invoked with: tensor([[[[-1.5244e+00, -8.3789e-01,  3.4912e-01,  ...,  6.6772e-02,
           -1.9275e-01, -1.6797e-01],
          [ 4.1870e-01,  6.6223e-02,  5.8777e-02,  ...,  2.0117e-01,
            4.4586e-02, -2.1289e-01],
          [ 4.6924e-01,  1.5417e-01, -2.1805e-02,  ..., -3.3594e-01,
            4.5483e-01, -5.4785e-01],
          ...,
          [ 1.3074e-01, -1.1396e+00, -6.9523e-04,  ...,  4.3701e-01,
           -4.8798e-02, -1.8457e-01],
          [ 1.6861e-02,  4.4629e-01,  2.3682e-01,  ...,  2.7661e-01,
            1.5167e-02, -9.0271e-02],
          [ 8.5400e-01,  1.6240e+00,  8.3252e-01,  ...,  2.8687e-01,
            1.6028e-01,  1.6223e-01]],

         [[-1.6797e+00, -1.0273e+00,  1.4380e-01,  ..., -4.4458e-01,
           -9.0218e-04, -6.3867e-01],
          [ 3.6938e-01,  2.4918e-02,  2.1143e-01,  ...,  8.4595e-02,
            1.5369e-01, -1.3635e-01],
          [ 3.2495e-01,  5.7343e-02,  7.4280e-02,  ..., -2.5610e-01,
           -9.8145e-02, -3.9941e-01],
          ...,
          [ 5.3223e-01, -9.7852e-01,  2.2778e-01,  ...,  3.7256e-01,
            5.4962e-02, -2.1692e-01],
          [ 1.5466e-01,  7.1094e-01,  3.5693e-01,  ...,  2.0300e-01,
            3.1567e-01,  8.7097e-02],
          [ 9.2627e-01,  1.9463e+00,  1.0293e+00,  ...,  2.7905e-01,
           -9.4910e-02,  1.8201e-01]],

         [[-5.5078e-01, -9.9170e-01,  4.5679e-01,  ..., -3.8550e-01,
            1.0199e-01, -3.4277e-01],
          [ 3.8892e-01, -2.5146e-01,  4.2651e-01,  ...,  1.5259e-02,
            1.0284e-01, -3.7720e-01],
          [ 1.0101e-01,  2.2070e-01,  8.9355e-02,  ..., -2.8003e-01,
            5.1666e-02, -5.7861e-01],
          ...,
          [-3.7354e-01, -4.0894e-01,  3.9185e-02,  ...,  9.4238e-02,
            3.3716e-01, -3.9209e-01],
          [-6.4258e-01, -2.2083e-01, -2.4048e-01,  ...,  2.6343e-01,
            2.4243e-01, -1.6321e-01],
          [ 5.9668e-01,  1.0107e+00,  5.1270e-01,  ..., -9.2957e-02,
           -3.9648e-01,  3.1543e-01]],

         ...,

         [[-1.4062e+00, -7.1924e-01, -5.5328e-02,  ..., -3.3646e-03,
            1.2061e-01, -3.6133e-01],
          [-4.1565e-02, -2.2412e-01,  2.1399e-01,  ..., -2.4399e-02,
            5.2582e-02, -3.6621e-01],
          [ 2.1594e-01, -7.2021e-02,  3.8013e-01,  ..., -4.2505e-01,
            3.2129e-01, -5.8105e-01],
          ...,
          [ 4.2944e-01, -2.9102e-01,  3.7427e-01,  ...,  2.5195e-01,
           -2.4036e-01, -4.1565e-02],
          [ 2.4133e-01,  1.1969e-01,  1.8433e-01,  ...,  3.3350e-01,
           -1.6113e-01,  4.2694e-02],
          [ 1.0371e+00,  1.1045e+00,  9.1846e-01,  ...,  7.2363e-01,
            2.5293e-01,  7.9346e-02]],

         [[-2.0332e+00, -1.2363e+00,  4.5959e-02,  ...,  1.8848e-01,
            1.8372e-01, -4.7681e-01],
          [ 4.7412e-01,  1.0803e-01,  2.3376e-01,  ...,  1.7322e-01,
            1.3196e-01, -2.9321e-01],
          [ 6.6406e-01,  1.4709e-01,  1.8945e-01,  ..., -3.1885e-01,
            1.4380e-01, -6.6650e-01],
          ...,
          [ 6.4209e-01, -1.7227e+00,  4.6362e-01,  ...,  6.5137e-01,
            1.9043e-02,  2.7252e-02],
          [ 1.9141e-01,  5.3320e-01,  4.4824e-01,  ...,  3.6890e-01,
           -1.8030e-01, -4.2749e-01],
          [ 1.1953e+00,  2.5215e+00,  1.0449e+00,  ...,  4.6313e-01,
            8.4412e-02, -6.8604e-02]],

         [[-7.6611e-01, -1.3174e+00,  5.9814e-01,  ..., -3.7817e-01,
            1.5454e-01, -4.4067e-01],
          [ 4.9292e-01, -1.5076e-01,  2.3730e-01,  ...,  1.6174e-01,
           -7.4341e-02, -1.9507e-01],
          [ 8.0505e-02,  2.9224e-01, -3.8501e-01,  ..., -4.2651e-01,
            3.5864e-01, -6.8066e-01],
          ...,
          [ 1.1074e+00, -3.1763e-01,  9.1650e-01,  ..., -2.4805e-01,
           -3.2251e-01, -2.6392e-01],
          [-3.1738e-01, -2.3145e-01, -1.9336e-01,  ...,  1.3550e-01,
           -1.5884e-02,  7.7637e-02],
          [ 6.7480e-01,  1.3789e+00,  4.9731e-01,  ..., -4.3365e-02,
           -6.3330e-01, -6.1066e-02]]]], device='cuda:0', dtype=torch.float16), tensor([[[[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
            0.0000e+00,  0.0000e+00],
          [ 8.4131e-01,  7.0752e-01,  5.7910e-01,  ...,  4.1723e-07,
            2.9802e-07,  2.3842e-07],
          [ 9.0918e-01,  1.0000e+00,  9.4434e-01,  ...,  8.3447e-07,
            6.5565e-07,  5.3644e-07],
          ...,
          [-7.0007e-02,  7.2754e-01,  9.9951e-01,  ...,  8.2336e-02,
            6.4758e-02,  5.0873e-02],
          [-8.7744e-01,  1.8173e-02,  7.9346e-01,  ...,  8.2336e-02,
            6.4758e-02,  5.0873e-02],
          [-8.7793e-01, -6.9092e-01,  2.9468e-01,  ...,  8.2336e-02,
            6.4758e-02,  5.0873e-02]]]], device='cuda:0', dtype=torch.float16), tensor([[[[ 1.0000e+00,  1.0000e+00,  1.0000e+00,  ...,  1.0000e+00,
            1.0000e+00,  1.0000e+00],
          [ 5.4053e-01,  7.0703e-01,  8.1543e-01,  ...,  1.0000e+00,
            1.0000e+00,  1.0000e+00],
          [-4.1626e-01, -8.6355e-04,  3.2935e-01,  ...,  1.0000e+00,
            1.0000e+00,  1.0000e+00],
          ...,
          [-9.9756e-01, -6.8604e-01, -3.7231e-02,  ...,  9.9658e-01,
            9.9805e-01,  9.9854e-01],
          [-4.7998e-01, -1.0000e+00, -6.0889e-01,  ...,  9.9658e-01,
            9.9805e-01,  9.9854e-01],
          [ 4.7876e-01, -7.2266e-01, -9.5557e-01,  ...,  9.9658e-01,
            9.9805e-01,  9.9854e-01]]]], device='cuda:0', dtype=torch.float16), 0, 56, 128, tensor(..., device='meta', size=(1, 1))
brucethemoose commented 7 months ago

Found it, its 99f6ac30373c29d3fae2bccb846e45497153008d that breaks quantizing.

89885be0feee057d0ac4b29c9b23458ae88328e3 works.

Specifically this change here?

https://github.com/turboderp/exllamav2/commit/99f6ac30373c29d3fae2bccb846e45497153008d#diff-2429505d6f3b79085069b7c9a692a068d019d7b2e7fb23ff6fdbc89d28038005L542-R555

brucethemoose commented 7 months ago

Separate note, the new optimization works great. I used to OOM at the very end with this command (and had to go in and edit the gpu flag for the job), but now it completes with the same command :+1:

turboderp commented 7 months ago

I imagine you've got the exllamav2 package installed, and you've pulled the latest changes but not rebuilt the extension. pip uninstall exllamav2 should do it, then it should build use the JIT version while quanting, which has the updated function prototype for rope_.

brucethemoose commented 7 months ago

That sounds like exactly what happened, thanks.