haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.48k stars 2.26k forks source link

[Question] Quantization does not improve latency #996

Open dzenilee opened 10 months ago

dzenilee commented 10 months ago

Question

I did some testing with 4-bit and 8-bit quantization and it doesn't seem to improve inference time at all - in fact, it seems to make it worse. All I did was simply set load_in_8bit or load_in_4bit to True here:

model_path = "liuhaotian/llava-v1.5-13b" 
tokenizer, model, image_processor, context_len = load_pretrained_model(
    model_path=model_path,
    model_base=None,
    load_in_8bit=True,
    model_name=get_model_name_from_path(model_path),
)  

Tokens per second at different batch sizes:

# batch size 4
original      36.741259
8bit           7.710737
4bit           7.889718

# batch size 8 
original      27.711023
8bit           6.584828
4bit           6.010469

# batch size 16 
original      18.238883
8bit           5.710271
4bit           4.769155

The original, unquantized model is much faster, though the gap between the different models decreases as the batch size increases. I've come across information suggesting that quantization doesn't necessarily improve latency due to dequantization overhead. Do those numbers align with your expectation wrt Llava, @haotian-liu?

wojiaoshihua commented 8 months ago

Have you improved this issue