microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.66k stars 2.93k forks source link

After 8bit quantization, the GPU inference speed is very slow #8330

Open tianjialai opened 3 years ago

tianjialai commented 3 years ago

Describe the bug A clear and concise description of what the bug is. To avoid repetition please make sure this is not one of the known issues mentioned on the respective release page.

Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.

System information

To Reproduce

Expected behavior A clear and concise description of what you expected to happen.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.

tianjialai commented 3 years ago

After 8bit quantization, the GPU inference speed is very slow

faxu commented 3 years ago

More information is needed to debug. Can you share your model, method of quantization, version of ORT used, etc?

tianjialai commented 3 years ago

onnxruntime-gpu==1.1.2

every quantization

The gpu version is even twice as slow as the cpu

yufenglee commented 3 years ago

@tianjialai , cuda ep doesn't quantization support currently. Please try TensorRT EP for quantization on GPU. Please refer to the wiki here: https://www.onnxruntime.ai/docs/how-to/quantization.html#quantization-on-gpu