turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.74k stars 215 forks source link

Running 3B and 33B sim. on 4090 #159

Closed SinanAkkoyun closed 1 year ago

SinanAkkoyun commented 1 year ago

Hello,

Is there some way to run the openllama 3B and llama 33B models simultaneously on one RTX 4090? The VRAM limits seem so close!! I am considering buying a 4090, I tested it on runpod and it sadly failed, the 3B model itself took 4GB according to nvtop.

So: Would it be somehow possible to distill the 33B model slightly down to make it work? That would be so important to me I was thinking about distilling a couple of not so important multilingual paths

So my main questions would be if it is possible to distill the 33B model only slightly, then quantize and load it up.

I just want to know if it's possible at all, I will do the rest to make it work (or perhaps you have a more elegant idea)

Thank you!

turboderp commented 1 year ago

Distillation is certainly a possibility, but trying to prune away specific abilities of the model you don't need seems like a mouthful.

Most of what I've been doing recently have been experiments on quantization methods. I'm not ready to really say too much about ExLlama V2, but I can say I've been getting the most promising results from modifying the GPTQ algorithm to include, among other things, a distillation step. One key observation is that while regular GPTQ treats all linear layers equally, they're very much not equal.

The pipe dream is to get 65B running respectably on a single 24 GB GPU, but that may not be realistic. Still, even getting halfway there would mean 33B could be quantized into a considerably smaller average number of bits per weight, leaving more than enough space for a 3B model to run alongside it.

EyeDeck commented 1 year ago

If it doesn't already fit, it would require either a smaller quantization method (and support for that quantization method by ExLlama), or a more memory efficient attention mechanism (conversion of LLaMA from multi-head attention to grouped-query or multi-query attention, plus ExLlama support), or an actually useful sparsity/pruning method, with your proposed hardware. So, it's possible in principle, but in practice isn't trivially doable just yet.

If you happen to already have any Nvidia GPU that's Turing or newer (so 16, 20, 30, 40 series), you could install that alongside a 4090 and run OpenLLaMA 3B on it no problem; and I guess a Pascal (10 series) would probably run fine too even with ExLlama's partial (read: slow) support of that μarch, given 3B's small size.

Or you could just do CPU inference with llama.cpp, 3B should be pretty quick even purely on a CPU.

At least in the US, if cost is at all an issue, you can readily find two 3090s on the second hand market for equal or slightly less than the cost of a single 4090, and that would definitely give you enough room for what you want. Performance isn't all that much slower than a single 4090 given the very similar memory bandwidth. That would also open up the possibility of running 65B, and I think there would be enough room left over after 65B (without any RoPE context extension) to fit 3B.

SinanAkkoyun commented 1 year ago

Thank you both for your in-depth answers!

@turboderp That sounds like phenomenal news, I am looking super forward to v2, even if the distillation doesn't make it's way

Could you perhaps please brieflt guide me through the manual distillation that you are doing? I would really like to experiment with that before v2 release (and perhaps help you somehow evaluating performance if you got any idea on what I should try?)

Thank you a lot as always!

Ph0rk0z commented 1 year ago

You can sequentially load the models too. Both 3b and 33b can cache into your ram and you can just swap them. Like a few seconds of load per.

SinanAkkoyun commented 1 year ago

@Ph0rk0z Thank you, but that would be too much delay. I need instant access all the time for drafting speculations and sampling them

turboderp commented 1 year ago

Could you perhaps please brieflt guide me through the manual distillation that you are doing? I would really like to experiment with that before v2 release (and perhaps help you somehow evaluating performance if you got any idea on what I should try?)

Guide you through it? I'm not even through it myself, so that's a tall order. :)

But briefly, I'm looking at quantized low-rank approximations of linear layers. This can condense layers down to as little as one bit per original weight (from just a singular value decomposition), but of course the approximations are too crude to be useful. Except (!!) for the first couple of Q, K projections where it's actually as good as full-rank quantization to 4 bits per weight.

So right now I'm experimenting with ways to apply this in conjunction with regular quantization as an error correction term. I'm not sure if it ultimately turns out to be worth it, but it would only be like the seventh dead end so far and I have many more ideas to explore.

One thing I've noticed is that the various parts of the model respond very differently to quantization. I've written a new GPTQ quantizer that runs in mixed mode, and this gives about a 5% reduction in overall size with negligible increase in perplexity, at least on the 3b model I'm mostly testing on. But it adds a lot of hyperparameters so there may be much more to gain with some tweaking. Sadly every tweak takes like 15 minutes to try out so it's a bit slow going.

No idea how anyone could help at this stage, though. I'm mostly just throwing ideas together and experimenting. I'll keep it in mind if I need to do some more thorough tests at some point, though.

bartowski1182 commented 1 year ago

Out of curiousity, since you mentioned mixed quants, are you planning on using algorithms similar to the proposed OWQ by Changhun Lee et al? the paper on it was fascinating, may make it possible to drop quants further if you keep the important columns at fp16 as per the paper

SinanAkkoyun commented 1 year ago

So right now I'm experimenting with ways to apply this in conjunction with regular quantization as an error correction term. I'm not sure if it ultimately turns out to be worth it, but it would only be like the seventh dead end so far and I have many more ideas to explore.

That's great to hear, I am super excited!

One thing I've noticed is that the various parts of the model respond very differently to quantization. I've written a new GPTQ quantizer that runs in mixed mode, and this gives about a 5% reduction in overall size with negligible increase in perplexity, at least on the 3b model I'm mostly testing on. But it adds a lot of hyperparameters so there may be much more to gain with some tweaking. Sadly every tweak takes like 15 minutes to try out so it's a bit slow going.

I see, it blows my mind how you built this repo mainly on your own.

But briefly, [...]

Thank you for the great explanations, they help me a lot! I wish you the best of luck in the further development, I hope this time it's not a dead end after all :)

Ph0rk0z commented 1 year ago

They released the new flash attention, does it reduce memory usage at all?

SinanAkkoyun commented 1 year ago

@Ph0rk0z Do you mean Exllama released support? I can't find any release notes or commits regarding flash attn implementation

Ph0rk0z commented 1 year ago

It would have to be implemented. I'm just asking if it would help here or not.

turboderp commented 1 year ago

I haven't had a change to look at it yet, or determine how much of a difference it would make. Keeping in mind that Flash Attention 1.0 doesn't make much of a difference in practice either. At least it hasn't so far. With longer contexts it might start to matter.