turboderp / exllama

A more memory-efficient rewrite of the HF transformers implementation of Llama for use with quantized weights.
MIT License
2.66k stars 214 forks source link

Speed on A100 #266

Open Ber666 opened 10 months ago

Ber666 commented 10 months ago

Hi, thanks for the cool project. I am testing Llama-2-70B-GPTQ with 1 * A100 40G, the speed is around 9 t/s

image

Is this the expected speed? I noticed in some other issues that the code is only optimized for consumer GPUs, but I just wanted to double check if that's the expected speed or I made mistakes somewhere

turboderp commented 10 months ago

I haven't tested 70B on A100 before, but the speed is close to what I've seen for 65B on A100, so I think this is about expected, yes.

jday96314 commented 10 months ago

To give you another data point, with 70B I get 10 - 13 t/s per A100 80 GB (SXM4).

akaikite commented 9 months ago

I can't believe that the a100 gets the same speed as the 3090. Maybe something can be improved here?

turboderp commented 9 months ago

There's definitely some room for improvement, but you're not going to see anything on the order of the difference in cost between the A100 and the 3090. When you're memory-bound, as you end up being here, what matters is that the A100 40G only has about 50-60% more global memory bandwidth than the 3090. So if the implementation is properly optimized and tuned for that architecture (ExLlama isn't, to be clear) then you're looking at 50-60% more tokens per second.

Now, if you're serving large batches, inference becomes compute-bound instead, and the A100 will outperform the 3090 very easily. But to serve large batches you also need a bunch more VRAM dedicated to state and cache. 40 GB won't get you very far, and even 80 GB is questionable. What use-case are you optimizing for, then? One quantized 70B model serving no more than 8 concurrent users, or something? A small business willing to invest in one A100 but not two, or three? Or if you're also trying to accommodate multi-A100 setups with tensor parallelism and whatnot, at what point does quantization stop making sense?

But yes, V2 is coming, and it's faster all around, including on the A100. So there's that.