Closed minipasila closed 1 week ago
Do you have any additional information? I'm not having any trouble quantizing that model here. Would help to know what version you're on, if you're using any custom calibration data, etc.
I tried doing it on Runpod and Colab and both of them gave me the same error for some reason. I wonder if it has something to do with the Pytorch version they use? Runpod template has Pytorch 2.2.0 by default installed.
edit: I'm using the default calibration dataset, at first I tried using 4096 context length and then 2048 and both failed. I'm using it from directly from the github repo.
I had been using this notebook for a while without any problems until now. https://colab.research.google.com/drive/1Cbb8nrwUxoxAbsIu1LLotsk2W52nj0Py
I managed to reproduce it by disabling flash-attn. So likely it's an overflow that happens during attention. I guess it's worth investigating, but in the meantime is there any way you could run it in an environment that supports flash-attn-2? Like a 3090 on RunPod?
RTX 3090 wasn't available and tried 4090 instead but that ended up having the same problem.
Did you install flash-attn?
Did you install flash-attn?
Nope.. I was just gonna edit my comment that I installed it and it now worked.
Previously I was able to quantize this model successfully LumiOpen/Viking-7B, but now it seems to be broken for some reason. No idea why it's misbehaving.