turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.45k stars 257 forks source link

QuIP#, New SOTA(?) 2-bit Quantization Method #176

Open brucethemoose opened 9 months ago

brucethemoose commented 9 months ago

Claimed metrics are not bad at all: https://cornell-relaxml.github.io/quip-sharp/

https://github.com/Cornell-RelaxML/quip-sharp

Model   Method  C4      Wiki    ArcC    ArcE    BoolQ   PiQA    WinoGrande 
2-70B   fp16    5.533   3.120   0.480   0.597   0.766   0.809   0.768
2-70B   QuIP#   6.535   4.156   0.469   0.595   0.795   0.785   0.740

Is this a candidate for exllamav2 integration? Just glancing at it, I'm not sure the approach would be very compatible.

brucethemoose commented 9 months ago

The weights are only 18GB: https://huggingface.co/relaxml/Llama-2-70b-E8P-2Bit/

That's like exl2 at ~2bpw.

turboderp commented 9 months ago

It's interesting and I am looking into it. But it looks like there's at least a little more to the story.

The total VRAM usage for their 2-bit model is more than a 2.5bpw EXL2 model. I can get up to about 1100 tokens of context with the QuIP# model, 2048 with the EXL2 model. Needs more investigating to figure out what the deal is there.

Inference is also fairly slow. On my 4090, the 70B model is 9 tokens/second with no context, compared to 35 tokens/second for the 2.5bpw EXL2 model. I haven't looked closely at the CUDA code, though, so there might be room for optimization.

brucethemoose commented 9 months ago

Isn't that because its a more "naive" modification to the transformers model runner than exllamav2's custom code?

I have only looked briefly as well.

dalistarh commented 9 months ago

QuIP requires each weight matrix to be "rotated" via another matrix multiplication before being quantized, and this rotation needs to be inverted at runtime. This is why the method may have additional runtime overhead.

turboderp commented 9 months ago

Yes. And QuIP# adds codebook encoding on top as well, which is always going to be less efficient than some bit shifting. I'm more surprised by the VRAM overhead. I know Transformers can be a little wasteful, but this still seems like a lot.

I haven't taken the time to really understand the QuIP paper, but apparently by multiplying the weights by random orthogonal matrices they become easier to quantize, and you can generate those at runtime, reproducibly so you only need to store a seed with the model weights.

The inverse of an orthogonal matrix is just its transpose, so that's cheap, but generating a large, random orthogonal matrix in the first place is time-consuming. So I'm guessing that's done at load time and maybe explains the VRAM overhead? Does the matrix need to be unique for each linear layer, or is there one per matrix shape (of which I guess there are four or five in a Llama model)?

tsengalb99 commented 9 months ago

Hi - author of QuIP# here - I just saw this thread after someone sent it to me. QuIP# has two main parts: 1) incoherence processing and 2) quantizing to a lattice codebook.

QuIP (original) generates random orthogonal matrices to do the incoherence processing. QuIP# uses a randomized Hadamard transform that requires storing (randomized) sign vectors for each dimension of a weight matrix and a small fixed Hadamard matrix for non-power-of-two weight matrices. The sign vectors result in an additional < 0.01 bits per weight. The fixed Hadamard matrices are generally quite small. For example, there are two such matrices in 70B: one for up/gate/down (28672 -> 28x28 Had matrix) and one for our fused q/k/v (10240 -> 20x20 Had matrix). Right now we're storing a copy per layer, which is not the most efficient thing to do, but they're so small (20x20, 28x28) that this is almost certainly not where the memory overhead is coming from. For 70B, there should only be about 189KB of overhead from storing these matrices in fp16 ((20x20 + 28x28)*2*80).

For speed, the main bottlenecks are kernel launching overhead and the decode+matmul part of the forward pass. We've mostly minimized kernel launching overhead by using CUDA graphs, but CUDA graphs aren't compatible with all parts of the huggingface code. If you were playing around with interactive_gen.py, which doesn't use CUDA graphs, QuIP# would appear slower than it actually is. We're actively working on making both the hadamard and matmul kernels faster and actually just merged in a new matmul kernel that's quite a bit faster than before.

On my 4090, the 70B model is 9 tokens/second with no context, compared to 35 tokens/second for the 2.5bpw EXL2 model. I haven't looked closely at the CUDA code, though, so there might be room for optimization.

There definitely is room for optimization, and this is something we're actively working on. We just merged in a new matmul kernel thats signficiantly faster than before.

I'm more surprised by the VRAM overhead. I know Transformers can be a little wasteful, but this still seems like a lot.

We're not doing anything special that should cause that much additional memory usage. This is probably an artifact of HF but we will look into this.

Does the matrix need to be unique for each linear layer, or is there one per matrix shape (of which I guess there are four or five in a Llama model)?

The stored matrix is actually fixed for each "base" matrix size. 70B uses the 20x20 and 28x28 Hadamard matrices which are very small to store.

Beyond this, we rcently uploaded 2 bit versions of the chat models at https://huggingface.co/relaxml. We encourage you to play around with those on the latest version of QuIP#, which is faster than QuIP# from a week before.

Feel free to open questions/issues on the QuIP# repo so we don't miss any discussions about it.