Open joehoover opened 1 year ago
The kernels are very specifically optimized for matrix-vector operations (batch size = 1). It also does well on matrix-matrix by reconstructing full-precision matrices on the fly and relying on cuBLAS. The in-between territory is problematic, but I guess the question is what sort of throughput you would expect. (?)
The kernels from NVIDIA folks at https://github.com/tlc-pack/cutlass_fpA_intB_gemm are probably interesting in the batched scenario.
Thanks for the wonderful repo, @turboderp!
I'm benchmarking latency on an A100 and I've observed latency increasing substantially as I increase batch size–to much larger degree than I'm used to (logs included below):
I'd love to know if I'm missing something or if this is expected!
Setup
I'm benchmarking with The Bloke's
gptq_model-4bit-128g
llama-2-13B-chat-GPTQ checkpoint.I'm using
test_benchmark_generation.py
with some minimal modifications to run these benchmarks.I'm instantiating
cache
with batch size and I'm warming up with a batch of ids.I'm generating tokens like:
bs=1
bs=2
bs=4