turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.49k stars 264 forks source link

## Measurement/inference error (3): hidden_states #600

Open IllIIllIlllIIl opened 3 weeks ago

IllIIllIlllIIl commented 3 weeks ago

Trying to make EXL2 quantization of Twilight-Miqu-146B using oobabooga/text-generation-webui version 1.14 always encounters an error at layer 169. I get the same error using the following two commands: cd exllamav2-0.1.9 python convert.py -i ../models/Twilight-Miqu-146B -o CCCW -cf ../models/Twilight-Miqu-146B-EXL2-RPCAL -c PIPPA-cleaned/pippa_raw_fix.parquet -b 8 -hb 8 -nr cd exllamav2-0.1.9 python convert.py -i ../models/Twilight-Miqu-146B -o CCCW -cf ../models/Twilight-Miqu-146B-EXL2 -b 8 -hb 8 -nr I also tried updating exllamav2 to version 0.1.9 and still got the same error. And I also tried re-downloading the original model file but it still didn’t work. error Environment: win10, AMD Ryzen Threadripper 7960X, 256GB RAM, 2x RTX 4090 & 3x RTX A6000, Python 3.11.9 I don't understand the programming code at all and only have a superficial understanding of it. Please help me. Thank you.

turboderp commented 2 weeks ago

I appreciate you going through the steps to troubleshoot, but there are cases where models simply can't be quantized with FP16 precision because quantization requires running a reference forward pass. If that overflows because the model isn't properly normalized, there isn't much I can immediately do about it.

ExLlama isn't really written to accept an arbitrary set of weights that isn't produced by training/finetuning. It makes certain assumptions about what language models are and how they work, and the sad truth is it doesn't take very long to cut a model up and stick it back together randomly, breaking any of those assumptions in the process and producing something that would take me hours or days to account for. There's simply no way for me to keep up.

IllIIllIlllIIl commented 2 weeks ago

Thank you for taking the time to explain I have successfully quantified 5.3 BPW on my old computer using an old version of exllamav2. (Unfortunately, I can’t remember which version it was, and I didn’t save the old quantization file and can’t view it from config.) So I would like to ask if 8 BPW quantification is not feasible, theoretically speaking, would something like 7.999 be feasible?

turboderp commented 2 weeks ago

It's not that specifically 8 bpw is a problem. It's something that goes wrong during inference on the unquantized model, resulting in inf values in the hidden state, most likely because the model isn't normalized. The error you showed happened during measurement so the target bitrate doesn't even factor into it at that point.

If a merge ever works it works more or less by accident, not because the methodology is sound. The results are always going to be unpredictable. I can't say exactly what the problem is in this case, and I don't even have a place to start. It's like your car isn't starting after you doubled the number of cylinders by cutting the engine in half and sticking another engine in the middle. The mechanic is just going to shrug and say, yeah, that's not how you get a more powerful engine.

I guess, some things you can try:

IllIIllIlllIIl commented 2 weeks ago

Thanks for the detailed explanation! Currently testing using 0.2.0 with -fst But compared with before, no matter which model has the -fst tag or not, the conversion speed is extremely slow. Can this be solved? time

PedroPareja commented 1 week ago

Hi!

I am in the same situation with other model: image

Could it be possible to have a flag to skip the calibration of those layers with inference error (so if there's inference error they are quantized in a blind default mode)? That way we could get all the good of the exl2 format even for these frankestein merge models (there's some of them very interesting).

Thanks for your great work!!!

PedroPareja commented 1 week ago

I have made a fork to address this issue.: https://github.com/PedroPareja/exllamav2

I requested a pull request in case my changes are helpful.