I did some testing with 4-bit and 8-bit quantization and it doesn't seem to improve inference time at all - in fact, it seems to make it worse. All I did was simply set load_in_8bit or load_in_4bit to True here:
# batch size 4
original 36.741259
8bit 7.710737
4bit 7.889718
# batch size 8
original 27.711023
8bit 6.584828
4bit 6.010469
# batch size 16
original 18.238883
8bit 5.710271
4bit 4.769155
The original, unquantized model is much faster, though the gap between the different models decreases as the batch size increases. I've come across information suggesting that quantization doesn't necessarily improve latency due to dequantization overhead. Do those numbers align with your expectation wrt Llava, @haotian-liu?
Question
I did some testing with 4-bit and 8-bit quantization and it doesn't seem to improve inference time at all - in fact, it seems to make it worse. All I did was simply set
load_in_8bit
orload_in_4bit
to True here:Tokens per second at different batch sizes:
The original, unquantized model is much faster, though the gap between the different models decreases as the batch size increases. I've come across information suggesting that quantization doesn't necessarily improve latency due to dequantization overhead. Do those numbers align with your expectation wrt Llava, @haotian-liu?