Dumb quantize/selective recompile/recapitation?

turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

MIT License

3.18k stars 233 forks source link

tldr: can we get a way to bypass calibration/measurement and save a 'calibration.json'? Not to produce better models so much but to patch/hack them.

Does this belong in issues? I think at least as a feature request it does?

Recapitation - head replacement :)

Hi, I recently found I could do two convenient things with simple changes to QParams.py despite not actually being able to code (any more) or understand much of the maths involved here. One basically involve forcing the quantizer not to think/make fewer decisions and just follow an implicit recipe instead. The other is enabling higher precision for what might be a particularly sensitive structure.

Force quick 6 - 8bpw quants (naive llama.cpp style I guess?) by commenting out all but the last/last two lists of qparams_attn and qparams_mlp. this took quantization down to 2 minutes and produced ~the same model -b 8 would have for a llama3 8B. It also struck me that just pruning this list down a bit rather than truncating it would still hugely speed up model quantization at the expense of memory economy rather than potential underestimation of a layer's importance. A lazy quant mode with some preset "recipes" would be nice for this, just to get an exl2 model spun up ASAP. Strictly speaking this could be done with almost no memory at all - the likes of a VPS spring to mind - though that's not my use case and I've never had a quantize issue that swap couldn't solve.
For qparams_headoptions...
```
...
8: QParams(128, [8], [1.0], 4),
9: QParams(32, [8], [1.0], 4),
```
-hb """9""" works, not tested beyond that. My hunch is if h6 is worse than h8 measurably for llama3 (it is, after 3 tokens divergence is significantly more likely) then a hb "8.2" might be helping more. Should I try scale bits besides 4 too when I do test properly?

I tested the head bit accuracy of hb6 vs hb8 by reusing the -o output directory from one quant to 'resume' it two layers prior, modifying the job_new.json (the manual mentions this, and it really is useful). But this is fiddly, it requires keeping the job around and knowing in advance you want to experiment.

I'd like a way to drop in a new lm_head on an existing model without having to go through the measure/calibration process. Bonus points if I don't require the whole model, just the files containing the original tensor(s) (the 'abliterated' 70B model makes a good case study of this being useful for other layers too). I describe a more general purpose extension of this below.

I think a lot model hackers would have a use for a tool which lets you not only mix and match (the original model's) tensors (which I assume I'd be able to do with just safetensors* if I wanted?) but for one which actually generates them too. I'm imagining "recipes" like this could produce an out_tensor directory ready to compile by just listing Tensor Name/QParam pairs in a .json or something. Be nice if this could include options for just copying (or even de-quantizing - at your own peril?) floating point tensors.

Maybe much this is already possible? How much would I need to change to implement this myself? Maybe VSCode would help me follow the execution of convert.py where Kate hasn't?

Now I've written this out it looks like a lot of work for something I could make in to a pull request instead of a beg with a few weeks of practice :sweat_smile: What would be easy/hard about hacking something like this together myself?

thank you so much for your hard work turboderp and also: curse you for figuring out that -Ofast does actually work for compiling exllamav2. Now my build isn't faster than everyone else's.

*as opposed to the extra tensors that exl2 breaks simple blocks of floats down in to to quantize them - no intention of fiddling with that.

Hi, I recently found I could do two convenient things with simple changes to QParams.py despite not actually being able to code (any more) or understand much of the maths involved here.

You can do this, but please don't. I would really prefer not to have to go on a whole quest to try to explain to people why they shouldn't do this.

You say "at your own peril" but that's not how these things work out in practice. I already made a big mistake exposing the calibration dataset as a parameter, and now I regularly have to spend time explaining to people that calibration is not finetuning, and whenever people complain about the quality I have to spend time investigating if they're actually using an "rpcal" model that someone pushed to HF and described as "better at RP" or whatever. Of course most people don't complain, they just get a bad first impression and lose interest long before considering that they might have come across a broken quant.

For the specific case of 8bpw models, yes, you will get the same result whether you measure first or force the highest QParams option. This could change in the future if I add any > 8bpw layer options, but it's a very niche case either way because precision really doesn't improve noticeably after 6bpw. In fact at one point asking for an 8bpw model would often give you a ~6bpw model because the optimizer couldn't find enough layers that would benefit at all from being stored in maximum precision. Now, it just essentially pads the model with useless extra precision because too many people assume it's a bug when their 8bpw version isn't larger than the 7bpw version.

That's really what it comes down to: communication. I've recently had people suggesting that FP16 output layers are "required" for some models, I think because of some bug or odd design choice in llama.cpp (?) which manifests with Phi3 specifically, and while I could very easily accommodate these people by adding a 16 bit head option, what I can't easily do is communicate what the consequences of turning it on would be, least of all to people who didn't turn the option on themselves but just downloaded an EXL2 model converted by someone else, which then ends up consuming 4 GB of extra VRAM for no good reason.

As for pruning the list of layer options, yes, that would indeed speed up measurement. It would also negatively impact quality by reducing the granularity at which the optimizer works. If you did this and converted to anything lower than 8bpw you would get a worse quantized model out of it, and once again you'd have the problem of communication. How should a model converted in this way be tagged so people know what they're getting? Should the framework emit a warning every time one of these models is loaded? How many bug reports would I have to respond to when people start seeing that warning pop up all the time?

And you don't have to measure again to change the head layer. You do need to quantize again, although you could do as you did and just replace the output tensor before compiling. Alternatively you can technically set the progress to "quantize" and reduce "q_last_module_idx" by a few to revert the job to right before it quantized the head layer, then resume from there. Kinda dodgy but if you're careful to pick the idx of the last checkpoint (since the calibration state from the previous layer is required in order to quantize correctly) it should work fine.

The problem with turning this into a feature would be designing an interface for it. And it comes at a cost of complexity, extra points of failure and maintenance debt, so there needs to be a really compelling case for it. Or someone willing to take on the responsibility of maintaining the feature indefinitely, because I'm already spread way too thin.

turboderp / exllamav2

Dumb quantize/selective recompile/recapitation? #516