microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.51k stars 2.9k forks source link

[Performance] FP16 model can not get acceleration on GPU with ONNXRuntime-GPU #15534

Open yeliang2258 opened 1 year ago

yeliang2258 commented 1 year ago

Describe the issue

Hello, I use the float16 tool to convert the FP32 model to the FP16 model and use ONNXRuntime-GPU 1.13.1 to inference. I found that many models cannot obtain inference acceleration. I want to know what kind of ONNX FP32 models can obtain inference acceleration using FP16 on GPU? ? Looking forward to your answer, thank you

To reproduce

None

Urgency

No response

Platform

Linux

OS Version

ubuntu 16.04

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

None

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

No response

Model File

No response

Is this a quantized model?

No

tianleiwu commented 1 year ago

FP16 accelerations need either of the following two: (1) GPUs like P100, V100, T4, A100 etc which has FP16 TFLOPS higher than FP32 TFLOPS, and your model has majority computation on MatMul, Gemm, Conv etc. (2) Model is I/O bound so that using FP16 inputs/outputs could speed up I/O. This depends on GPU memory bandwidth, input and output size, and compute latency.

Please try optimize FP32 model first, then convert the optimized model to FP16. Otherwise, some optimizations might not be applied to FP16 model.

One way to do that is a session option to save optimized model like: https://github.com/microsoft/onnxruntime/blob/a30b57da6e1d985a5d6ecf433206c212cc469f8c/onnxruntime/python/tools/transformers/optimizer.py#L107

lucasjinreal commented 1 year ago

@tianleiwu No at all, my rtx2060 didn't get speedup either, even worse, fp16 slower than fp32 my GPU has tensorcores and SM for fp16 calculation.

yeliang2258 commented 1 year ago

@tianleiwu No at all, my rtx2060 didn't get speedup either, even worse, fp16 slower than fp32 my GPU has tensorcores and SM for fp16 calculation.

My test result is similar to yours, most of the models did not gain speedup on T4