[Question] shouldn't int8/int4 be faster?

haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

https://llava.hliu.cc

Apache License 2.0

19.26k stars 2.11k forks source link

[Question] shouldn't int8/int4 be faster? #941

Open ganliqiang opened 8 months ago

ganliqiang commented 8 months ago

Question

Why is model inference with int8/int4 so much slower than with float16/float32? Aside from decreased memory consumption, shouldn't int8/int4 be faster?

gulegeji commented 8 months ago

same question, this is my speed: 4bit: 1704187948408 8bit: 1704187897736 full-precision: 1704187920503

=== i use 1xA800 for inference, and use merged lora

RylanSchaeffer commented 6 months ago

@ganliqiang @gulegeji how did you two set the precision?