After 8bit quantization, the GPU inference speed is very slow

tianjialai commented 3 years ago

Describe the bug A clear and concise description of what the bug is. To avoid repetition please make sure this is not one of the known issues mentioned on the respective release page.

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):
ONNX Runtime installed from (source or binary):
ONNX Runtime version:
Python version:
Visual Studio version (if applicable):
GCC/Compiler version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory:

To Reproduce

Describe steps/code to reproduce the behavior.
Attach the ONNX model to the issue (where applicable) to expedite investigation.

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

tianjialai commented 3 years ago

After 8bit quantization, the GPU inference speed is very slow

faxu commented 3 years ago

More information is needed to debug. Can you share your model, method of quantization, version of ORT used, etc?

tianjialai commented 3 years ago

onnxruntime-gpu==1.1.2

every quantization

The gpu version is even twice as slow as the cpu

yufenglee commented 3 years ago

@tianjialai , cuda ep doesn't quantization support currently. Please try TensorRT EP for quantization on GPU. Please refer to the wiki here: https://www.onnxruntime.ai/docs/how-to/quantization.html#quantization-on-gpu

microsoft / onnxruntime

After 8bit quantization, the GPU inference speed is very slow #8330