Closed DominickZhang closed 1 year ago
Hi, this is expected behaviour. This is because int8 on ZeRO-Quant (DeepSpeed) is not using CUDA kernels yet. You might want to ask about this support in the DeepSpeed repo :)
Hi, this is expected behaviour. This is because int8 on ZeRO-Quant (DeepSpeed) is not using CUDA kernels yet. You might want to ask about this support in the DeepSpeed repo :)
Thanks a lot! Now I am clear!
I just referred to an official website, where the mixture of quantization (MoQ) is introduced. MoQ involves high-performance INT8 inference kernels in DeepSpeed. I am wondering why we use ZeRO-Quant here instead of MoQ?
Haven't looked into MoQ. Closing this for now :)
Hi,
I am confused about the performance of ds-inference int8. I think it should be at least as fast as ds-inference fp16, but it turns out to be slower. Could you help with my problem? Many thanks!