huggingface / transformers-bloom-inference

Fast Inference Solutions for BLOOM
Apache License 2.0
560 stars 114 forks source link

Why does ds-inference int8 run slower than ds-inference fp16? #79

Closed DominickZhang closed 1 year ago

DominickZhang commented 1 year ago

Hi,

I am confused about the performance of ds-inference int8. I think it should be at least as fast as ds-inference fp16, but it turns out to be slower. Could you help with my problem? Many thanks!

image
mayank31398 commented 1 year ago

Hi, this is expected behaviour. This is because int8 on ZeRO-Quant (DeepSpeed) is not using CUDA kernels yet. You might want to ask about this support in the DeepSpeed repo :)

DominickZhang commented 1 year ago

Hi, this is expected behaviour. This is because int8 on ZeRO-Quant (DeepSpeed) is not using CUDA kernels yet. You might want to ask about this support in the DeepSpeed repo :)

Thanks a lot! Now I am clear!

I just referred to an official website, where the mixture of quantization (MoQ) is introduced. MoQ involves high-performance INT8 inference kernels in DeepSpeed. I am wondering why we use ZeRO-Quant here instead of MoQ?

mayank31398 commented 1 year ago

Haven't looked into MoQ. Closing this for now :)