Is "High-performance INT8 inference kernels" released?

wanghaoshuang commented 2 years ago

https://www.microsoft.com/en-us/research/blog/deepspeed-accelerating-large-scale-model-inference-and-training-via-system-optimizations-and-compression/

High-performance INT8 inference kernels are extensions of generic and specialized Transformer kernels discussed earlier, designed to work together with INT8 parameters trained using MoQ. These kernels offer the same set of optimizations as the FP16 versions, but instead of loading FP16 parameters from main memory, they load INT8 parameters. Once the parameters are loaded to registers or shared memory, they are converted on-the-fly to FP16 before they are used in inference computation. Loading INT8 instead of FP16 reduces the data movement from main memory by half, resulting in up to 2x improvement in inference performance.

Is the code of "High-performance INT8 inference kernels" mentioned above released in this repo?

RezaYazdaniAminabadi commented 2 years ago

Hi @wanghaoshuang

It is going to be released soon. Please stay tuned.

Thanks, Reza

wanghaoshuang commented 2 years ago

@RezaYazdaniAminabadi Thanks. How do you implement the INT8 gemm? based on API of cublas? Or rewrite a cuda kernel to support for reading int8 weights ?

gsujankumar commented 2 years ago

Hey @RezaYazdaniAminabadi, are the performance gains in the blog based on these unreleased INT8 kernels?

microsoft / DeepSpeed

Is "High-performance INT8 inference kernels" released? #1833