Open yeliang2258 opened 1 year ago
FP16 accelerations need either of the following two: (1) GPUs like P100, V100, T4, A100 etc which has FP16 TFLOPS higher than FP32 TFLOPS, and your model has majority computation on MatMul, Gemm, Conv etc. (2) Model is I/O bound so that using FP16 inputs/outputs could speed up I/O. This depends on GPU memory bandwidth, input and output size, and compute latency.
Please try optimize FP32 model first, then convert the optimized model to FP16. Otherwise, some optimizations might not be applied to FP16 model.
One way to do that is a session option to save optimized model like: https://github.com/microsoft/onnxruntime/blob/a30b57da6e1d985a5d6ecf433206c212cc469f8c/onnxruntime/python/tools/transformers/optimizer.py#L107
@tianleiwu No at all, my rtx2060 didn't get speedup either, even worse, fp16 slower than fp32 my GPU has tensorcores and SM for fp16 calculation.
@tianleiwu No at all, my rtx2060 didn't get speedup either, even worse, fp16 slower than fp32 my GPU has tensorcores and SM for fp16 calculation.
My test result is similar to yours, most of the models did not gain speedup on T4
Describe the issue
Hello, I use the float16 tool to convert the FP32 model to the FP16 model and use ONNXRuntime-GPU 1.13.1 to inference. I found that many models cannot obtain inference acceleration. I want to know what kind of ONNX FP32 models can obtain inference acceleration using FP16 on GPU? ? Looking forward to your answer, thank you
To reproduce
None
Urgency
No response
Platform
Linux
OS Version
ubuntu 16.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
None
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
No response
Model File
No response
Is this a quantized model?
No