turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

Error when trying to quantize Viking-7B #459

Closed minipasila closed 1 week ago

minipasila commented 1 month ago

Previously I was able to quantize this model successfully LumiOpen/Viking-7B, but now it seems to be broken for some reason. No idea why it's misbehaving.

-- Layer: model.layers.1 (Attention)
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
!! Warning: Applied additional damping
Traceback (most recent call last):
File "/content/exllamav2/conversion/adaptivegptq.py", line 292, in prepare
hessian_inv = torch.linalg.cholesky(hessian)
torch._C._LinAlgError: linalg.cholesky: The factorization could not be completed because the input is not positive-definite (the leading minor of order 1 is not positive-definite).

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/content/exllamav2/convert.py", line 240, in <module>
status = measure_quant(job, save_job, model)  # capturing the graceful exits
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/content/exllamav2/conversion/measure.py", line 560, in measure_quant
m = measure_attn(module, hidden_states, target_states, quantizers, cache, attn_params)
File "/content/exllamav2/conversion/measure.py", line 145, in measure_attn
quantizers["o_proj"].prepare()
File "/content/exllamav2/conversion/adaptivegptq.py", line 330, in prepare
raise ValueError("Hessian is not invertible")
ValueError: Hessian is not invertible
turboderp commented 1 month ago

Do you have any additional information? I'm not having any trouble quantizing that model here. Would help to know what version you're on, if you're using any custom calibration data, etc.

minipasila commented 1 month ago

I tried doing it on Runpod and Colab and both of them gave me the same error for some reason. I wonder if it has something to do with the Pytorch version they use? Runpod template has Pytorch 2.2.0 by default installed.

edit: I'm using the default calibration dataset, at first I tried using 4096 context length and then 2048 and both failed. I'm using it from directly from the github repo.

minipasila commented 1 month ago

I had been using this notebook for a while without any problems until now. https://colab.research.google.com/drive/1Cbb8nrwUxoxAbsIu1LLotsk2W52nj0Py

turboderp commented 1 month ago

I managed to reproduce it by disabling flash-attn. So likely it's an overflow that happens during attention. I guess it's worth investigating, but in the meantime is there any way you could run it in an environment that supports flash-attn-2? Like a 3090 on RunPod?

minipasila commented 1 month ago

RTX 3090 wasn't available and tried 4090 instead but that ended up having the same problem.

turboderp commented 1 month ago

Did you install flash-attn?

minipasila commented 1 month ago

Did you install flash-attn?

Nope.. I was just gonna edit my comment that I installed it and it now worked.