turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.18k stars 233 forks source link

Q-Cache - Token Generation Speed #499

Open Vhallo opened 2 weeks ago

Vhallo commented 2 weeks ago

I saw you just added Q8 / Q6, nice! However, I also quickly tested the speed and FP16 seems to be up to ~50% faster than Q-Cache. Q4/6/8 doesn't seem to have any meaningful difference beyond the margin of error in speed.

Example at 20k Context Size (L3-8B): Token Generation: 36.24 T/s [FP16] -> 24.06 T/s [Q4] Token Generation: 36.49 T/s [FP16] -> 25.07 T/s [Q8]

Is this large difference in speed expected?

turboderp commented 2 weeks ago

What GPU are you getting these speeds on, and is it a quantized model?

Vhallo commented 2 weeks ago

4070 and it's 8.0bpw

turboderp commented 2 weeks ago

I'm not sure about the 4070 specifically, but yes, the cache quantization does add overhead. It's hard to predict exactly how much, and it will vary from model to model, with less of an impact on larger models. I'm always working on optimizations, but as far as the cache dequantization kernel goes it's pretty much memory bound on the 4090s I mostly test on.

I guess overall, I guess a 33% drop isn't way outside of what I'd expect from a long context on a small model, though it does seem a little high. :shrug:

Vhallo commented 2 weeks ago

I see! Definitely was expecting some performance cost, just was surprised how much faster fp16 is. 4070 has about half the memory speed of a 4090 though. I'm curious how much people with a 4060ti suffer then when using Q-Cache, it's memory is real slow...

Well, keep up the good work and I'm looking forward to future developments, optimizations included!