microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.69k stars 2.93k forks source link

[Performance] QuantizeLinear/DequantizeLinear node does not support per-channel on GPU? #15260

Open Kelang-Tian opened 1 year ago

Kelang-Tian commented 1 year ago

Describe the feature request

QuantizeLinear/DequantizeLinear CUDA kernels do not support per-channel

Describe scenario use case

In order to fit a larger model without loss of accuracy when the GPU memory is limited, I want to perform int8 quantization only on the weights.
 However, when I infer the model, I found that the DequantizeLinear node takes a lot of time (as shown in the figure below) because the op is not running by the CUDAExecutionProvider. Considering that the QuantizeLinear/DequantizeLinear CUDA kernels do not support per-channel version implementation, I intend to contribute a version of Q/DQ that supports per-channel functions to the community. Not sure if it is suitable?

Q/DQ implementation onnxruntime/core/providers/cuda/tensor/quantize_linear.cc image

elephantpanda commented 1 year ago

Is this why my CUDA timings for the quantized models is slow slow compared to DirectML? #15328 Is there a way to make it faster I missed?

Kelang-Tian commented 1 year ago

Is this why my CUDA timings for the quantized models is slow slow compared to DirectML? #15328 Is there a way to make it faster I missed?

It is very likely that your model takes a long time because there is a per-channel quantization method in your quantized model. And this method has no GPU implementation, so maybe it will be much faster to change to per-tensor quantization (but per-tensor quantization will affect the accuracy)