Closed cvhoang closed 3 months ago
Did you actually implement batching, or use one of the batching examples? Just setting config.max_batch_size
don't make it batch, it just makes it possible to batch.
I have implemented batching and it definitely scales for me.
@bjj I use ExllamaV2
class directly:
rr_model = ExLlamaV2(config)
...
outputs = rr_model.forward(batch_dict['input_ids'], input_mask=mask, position_offsets=offsets)
In which batch_dict['input_ids']
shape is (batch_size, max_len)
. I assume this is batch processing?
This could depend on what GPU you're running on, or if you're maybe bottlenecked by CPU performance. More details would help.
I'm running 3090. The CPU is AMD 5950x with 128GB of memory.
Well, the effective batch size is the batch size times the sequence length. In most aspects, doing 100 sequences of length 4 is the same performance as 1 sequence of length 400. In either case, the matmuls have a shape of m=400, n=k=hidden_dim (or intermediate_dim). So it's the m value that determines whether you're compute-bound or memory-bound.
The speed advantage of quantization comes from spending less time streaming weights from memory, so when you're memory-bound (low m) you can get up to a 4x speedup going from 16 to 4 bits per weight. For higher m, the limiting factor will be the compute throughput and quantization doesn't improve performance.
If you've got 4 input sequences each of length 1000, that means m=4000 for the initial forward pass (compute-bound), and m=4 for each token you generate afterwards thanks to caching (largely memory-bound). In sequence classification you usually wouldn't reach the second part, or you only generate a few tokens per sequence.
Thanks @turboderp. I think compute definitely dominates my workload. Your explanation makes sense.
I've been used a modified version of
exllamav2
for sequence classification. While it's working great, there seems to be minimal speed gain from quantizing the model to 4 bits. Basically quantized model runs at around 15% faster than the full model. Additionally, increase the batch size doesn't seem to improve the throughput: running 100 examples through the model with batch_size=4 and batch_size=1 both result in the same run time.I'm wondering why this is the case? Thank you.