Open Kelang-Tian opened 1 year ago
Is this why my CUDA timings for the quantized models is slow slow compared to DirectML? #15328 Is there a way to make it faster I missed?
Is this why my CUDA timings for the quantized models is slow slow compared to DirectML? #15328 Is there a way to make it faster I missed?
It is very likely that your model takes a long time because there is a per-channel quantization method in your quantized model. And this method has no GPU implementation, so maybe it will be much faster to change to per-tensor quantization (but per-tensor quantization will affect the accuracy)
Describe the feature request
QuantizeLinear/DequantizeLinear CUDA kernels do not support per-channel
Describe scenario use case
In order to fit a larger model without loss of accuracy when the GPU memory is limited, I want to perform int8 quantization only on the weights. However, when I infer the model, I found that the DequantizeLinear node takes a lot of time (as shown in the figure below) because the op is not running by the CUDAExecutionProvider. Considering that the QuantizeLinear/DequantizeLinear CUDA kernels do not support per-channel version implementation, I intend to contribute a version of Q/DQ that supports per-channel functions to the community. Not sure if it is suitable?
Q/DQ implementation
onnxruntime/core/providers/cuda/tensor/quantize_linear.cc