question about expected performance of quantized operations

carsonswope commented 2 years ago

Hi,

I'm running DirectML 1.9.0 w/ an NVIDIA GTX 1080ti GPU. I've been experimenting with the quantized operations provided by DirectML.

I have found that on my system, the DML_QUANTIZED_LINEAR_CONVOLUTION_OPERATOR and DML_CONVOLUTION_INTEGER_OPERATOR perform about 10x slower than the normal DML_CONVOLUTION_OPERATOR operator for equivalent convolution operations, even when any quantize/dequantize processing steps are removed. I know that my GPU offers some hardware support for int8 computation, because I'm able to run quantized models via TensorRT and see a speedup. However, clearly DirectML is not finding the 'fast' implementation.

Is this expected behavior for my hardware? Should I expect to see a speedup running quantized operations on a newer NVIDIA GPU with more better hardware support for imma?

Thanks!

--

Looking at the NVIDIA hardware support table, my gpu (compute capability 6.1) supports int8 but not int8 tensor core, which is only available at the next generation.

adtsai commented 2 years ago

Hi, quantization support on GPUs is still maturing and improving our integer performance (and INT8/UINT8 in particular) is something that we're still working on. In particular our fast path quantized operators rely on a feature introduced in Shader Model 6.4, which isn't supported by all GPUs and drivers yet.

If you have a particular scenario in mind, we'd love to hear about your use case if you're comfortable sharing it. This'll help us figure out what to optimize for as we continue to work on our integer performance.

carsonswope commented 2 years ago

Hey, thanks for the quick response @adtsai.

My use case is to deploy ML models for video & image processing as part of plugins for popular video editing tools. DirectML is appealing because of compatibility across more than just NVIDIA GPUs as well as much smaller distributable size than the CuDNN+TensorRT libraries. However execution time is important, especially when working with 4k video. I've been able to achieve significant speedups using quantization w/ TensorRT, but, like I say, haven't been able to duplicate the speedup on DirectML, at least w/ the hardware that I have.

So, basically, I'm looking for fast execution of quantized convolution & matrix multiply operations. If you're curious about specifics, one model I'm working with right now is FastDVDNet, which is CNN-based video denoising. I'm also looking at some transformer-based image processing, such as Dense Prediction Transformers, for monocular depth extraction.

FYI: It seems like my GPU does support Shader Model 6.4 (https://www.techpowerup.com/gpu-specs/geforce-gtx-1080-ti.c2877). Is it at all possible to get some kind of log of the decision-making process of the DirectML 'compiler' for a given graph? It would be super helpful to have a little more insight into why it might be missing the fast path.

Thanks.. hope this is helpful.

wunianqing commented 1 year ago

I ran into the same situation. Is there any update here?

daiyicun commented 1 year ago

we are having the same problem, does DirectML now support int8 quantization after year later? thanks.

yuriymus commented 1 week ago

+1

microsoft / DirectML

question about expected performance of quantized operations #282