Open tianjialai opened 3 years ago
After 8bit quantization, the GPU inference speed is very slow
More information is needed to debug. Can you share your model, method of quantization, version of ORT used, etc?
onnxruntime-gpu==1.1.2
every quantization
The gpu version is even twice as slow as the cpu
@tianjialai , cuda ep doesn't quantization support currently. Please try TensorRT EP for quantization on GPU. Please refer to the wiki here: https://www.onnxruntime.ai/docs/how-to/quantization.html#quantization-on-gpu
Describe the bug A clear and concise description of what the bug is. To avoid repetition please make sure this is not one of the known issues mentioned on the respective release page.
Urgency If there are particular important use cases blocked by this or strict project-related timelines, please share more information and dates. If there are no hard deadlines, please specify none.
System information
To Reproduce
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here. If the issue is about a particular model, please share the model details as well to facilitate debugging.