Open IllIIllIlllIIl opened 3 weeks ago
I appreciate you going through the steps to troubleshoot, but there are cases where models simply can't be quantized with FP16 precision because quantization requires running a reference forward pass. If that overflows because the model isn't properly normalized, there isn't much I can immediately do about it.
ExLlama isn't really written to accept an arbitrary set of weights that isn't produced by training/finetuning. It makes certain assumptions about what language models are and how they work, and the sad truth is it doesn't take very long to cut a model up and stick it back together randomly, breaking any of those assumptions in the process and producing something that would take me hours or days to account for. There's simply no way for me to keep up.
Thank you for taking the time to explain I have successfully quantified 5.3 BPW on my old computer using an old version of exllamav2. (Unfortunately, I can’t remember which version it was, and I didn’t save the old quantization file and can’t view it from config.) So I would like to ask if 8 BPW quantification is not feasible, theoretically speaking, would something like 7.999 be feasible?
It's not that specifically 8 bpw is a problem. It's something that goes wrong during inference on the unquantized model, resulting in inf
values in the hidden state, most likely because the model isn't normalized. The error you showed happened during measurement so the target bitrate doesn't even factor into it at that point.
If a merge ever works it works more or less by accident, not because the methodology is sound. The results are always going to be unpredictable. I can't say exactly what the problem is in this case, and I don't even have a place to start. It's like your car isn't starting after you doubled the number of cylinders by cutting the engine in half and sticking another engine in the middle. The mechanic is just going to shrug and say, yeah, that's not how you get a more powerful engine.
I guess, some things you can try:
-fst
as an argument to the converter. It'll bypass the safetensors library which has been causing a number of issues recently.dev
branch which is slightly ahead.Thanks for the detailed explanation! Currently testing using 0.2.0 with -fst But compared with before, no matter which model has the -fst tag or not, the conversion speed is extremely slow. Can this be solved?
Hi!
I am in the same situation with other model:
Could it be possible to have a flag to skip the calibration of those layers with inference error (so if there's inference error they are quantized in a blind default mode)? That way we could get all the good of the exl2 format even for these frankestein merge models (there's some of them very interesting).
Thanks for your great work!!!
I have made a fork to address this issue.: https://github.com/PedroPareja/exllamav2
I requested a pull request in case my changes are helpful.
Trying to make EXL2 quantization of Twilight-Miqu-146B using oobabooga/text-generation-webui version 1.14 always encounters an error at layer 169. I get the same error using the following two commands:
cd exllamav2-0.1.9 python convert.py -i ../models/Twilight-Miqu-146B -o CCCW -cf ../models/Twilight-Miqu-146B-EXL2-RPCAL -c PIPPA-cleaned/pippa_raw_fix.parquet -b 8 -hb 8 -nr
cd exllamav2-0.1.9 python convert.py -i ../models/Twilight-Miqu-146B -o CCCW -cf ../models/Twilight-Miqu-146B-EXL2 -b 8 -hb 8 -nr
I also tried updating exllamav2 to version 0.1.9 and still got the same error. And I also tried re-downloading the original model file but it still didn’t work. Environment: win10, AMD Ryzen Threadripper 7960X, 256GB RAM, 2x RTX 4090 & 3x RTX A6000, Python 3.11.9 I don't understand the programming code at all and only have a superficial understanding of it. Please help me. Thank you.