turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.2k stars 236 forks source link

4 bit quantization performance? #299

Closed cvhoang closed 3 months ago

cvhoang commented 5 months ago

I've been used a modified version of exllamav2 for sequence classification. While it's working great, there seems to be minimal speed gain from quantizing the model to 4 bits. Basically quantized model runs at around 15% faster than the full model. Additionally, increase the batch size doesn't seem to improve the throughput: running 100 examples through the model with batch_size=4 and batch_size=1 both result in the same run time.

I'm wondering why this is the case? Thank you.

bjj commented 5 months ago

Did you actually implement batching, or use one of the batching examples? Just setting config.max_batch_size don't make it batch, it just makes it possible to batch.

I have implemented batching and it definitely scales for me.

cvhoang commented 5 months ago

@bjj I use ExllamaV2 class directly:

rr_model = ExLlamaV2(config) ... outputs = rr_model.forward(batch_dict['input_ids'], input_mask=mask, position_offsets=offsets)

In which batch_dict['input_ids'] shape is (batch_size, max_len). I assume this is batch processing?

turboderp commented 3 months ago

This could depend on what GPU you're running on, or if you're maybe bottlenecked by CPU performance. More details would help.

cvhoang commented 3 months ago

I'm running 3090. The CPU is AMD 5950x with 128GB of memory.

turboderp commented 3 months ago

Well, the effective batch size is the batch size times the sequence length. In most aspects, doing 100 sequences of length 4 is the same performance as 1 sequence of length 400. In either case, the matmuls have a shape of m=400, n=k=hidden_dim (or intermediate_dim). So it's the m value that determines whether you're compute-bound or memory-bound.

The speed advantage of quantization comes from spending less time streaming weights from memory, so when you're memory-bound (low m) you can get up to a 4x speedup going from 16 to 4 bits per weight. For higher m, the limiting factor will be the compute throughput and quantization doesn't improve performance.

If you've got 4 input sequences each of length 1000, that means m=4000 for the initial forward pass (compute-bound), and m=4 for each token you generate afterwards thanks to caching (largely memory-bound). In sequence classification you usually wouldn't reach the second part, or you only generate a few tokens per sequence.

cvhoang commented 3 months ago

Thanks @turboderp. I think compute definitely dominates my workload. Your explanation makes sense.