Open ganliqiang opened 8 months ago
Why is model inference with int8/int4 so much slower than with float16/float32? Aside from decreased memory consumption, shouldn't int8/int4 be faster?
same question, this is my speed: 4bit: 8bit: full-precision:
=== i use 1xA800 for inference, and use merged lora
@ganliqiang @gulegeji how did you two set the precision?
Question
Why is model inference with int8/int4 so much slower than with float16/float32? Aside from decreased memory consumption, shouldn't int8/int4 be faster?