Why does ds-inference int8 run slower than ds-inference fp16?

huggingface / transformers-bloom-inference

Fast Inference Solutions for BLOOM

Apache License 2.0

560 stars 114 forks source link

Why does ds-inference int8 run slower than ds-inference fp16? #79

Closed DominickZhang closed 1 year ago

DominickZhang commented 1 year ago

Hi,

I am confused about the performance of ds-inference int8. I think it should be at least as fast as ds-inference fp16, but it turns out to be slower. Could you help with my problem? Many thanks!

mayank31398 commented 1 year ago

Hi, this is expected behaviour. This is because int8 on ZeRO-Quant (DeepSpeed) is not using CUDA kernels yet. You might want to ask about this support in the DeepSpeed repo :)

DominickZhang commented 1 year ago

Hi, this is expected behaviour. This is because int8 on ZeRO-Quant (DeepSpeed) is not using CUDA kernels yet. You might want to ask about this support in the DeepSpeed repo :)

Thanks a lot! Now I am clear!

I just referred to an official website, where the mixture of quantization (MoQ) is introduced. MoQ involves high-performance INT8 inference kernels in DeepSpeed. I am wondering why we use ZeRO-Quant here instead of MoQ?

mayank31398 commented 1 year ago

Haven't looked into MoQ. Closing this for now :)