[Question] Quantization does not improve latency

Question

I did some testing with 4-bit and 8-bit quantization and it doesn't seem to improve inference time at all - in fact, it seems to make it worse. All I did was simply set load_in_8bit or load_in_4bit to True here:

model_path = "liuhaotian/llava-v1.5-13b" 
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    load_in_8bit=True,
    model_name=get_model_name_from_path(model_path),
)

Tokens per second at different batch sizes:

# batch size 4
original      36.741259
8bit           7.710737
4bit           7.889718

# batch size 8 
original      27.711023
8bit           6.584828
4bit           6.010469

# batch size 16 
original      18.238883
8bit           5.710271
4bit           4.769155

The original, unquantized model is much faster, though the gap between the different models decreases as the batch size increases. I've come across information suggesting that quantization doesn't necessarily improve latency due to dequantization overhead. Do those numbers align with your expectation wrt Llava, @haotian-liu?

haotian-liu / LLaVA

[Question] Quantization does not improve latency #996

Question