microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.22k stars 2.87k forks source link

slow fp16 performance #10919

Open soundarthiaga opened 2 years ago

soundarthiaga commented 2 years ago

Is your feature request related to a problem? Please describe. The fp16 inference doesn't optimize the graph the same way as fp32 inference, so the fp16 performance is slower than fp32

System information

Describe the solution you'd like The solution is to add the cast nodes before applying transformers, a switch in the order of optimization solves the problem and makes fp16 inference faster

faxu commented 2 years ago

Are you using onnxmltools to convert fp32 to fp16?

prashantskit commented 2 years ago

@soundarthiaga , The solution you mention is specific to transformers. Can you suggest what I should do rectify my problem?

Hi, I am also experiencing the same issue. The performance of fp16 is slower in comparison to fp32 by significant amount on CPU.

Description: I am converting tacotron2 model to onnx using this link onnx-export Framework - Pytorch version 1.10.2 Exporing method - torch.onnx.export opset - 11 onnx version - 1.11.0 OS - Ubuntu 20.04 Processor - Intel Xeon Inference is done using onnxruntime onnx runtime version - 1.11.1

I used convert_float_to_float16 function from onnxconverter_common quantisation input model is onnx float 32 and expected output is float16.

import onnx
from onnxconverter_common.float16 import convert_float_to_float16

model_fp32_decoder = 'tacotron2/outdir/working-onnx/decoder_iter.onnx'
model_quant_decoder = 'tacotron2/outdir/working-onnx/decoder_iter.fp16.onnx'
model = onnx.load_model(model_fp32_decoder)
new_model = convert_float_to_float16(model)

onnx.save_model(new_model, model_quant_decoder) 

In one of the issue I read fp16 is not supported in Arm but I am using Intel. Can somebody help with the issue? Let me know if more information is needed.

EmreOzkose commented 1 year ago

Any improvement on this issue?

yufenglee commented 1 year ago

It is expected. ORT computes fp16 with fp32 operators on CPU now. Intel CPUs doesn't support fp16 actually. It introduces avx512-fp16 in SPR microarchitecture, which will be released to public early next year. We are investigating to add support for it.

EmreOzkose commented 1 year ago

Thank you for detailed explanation

ShimaShahfar commented 1 year ago

It is expected. ORT computes fp16 with fp32 operators on CPU now. Intel CPUs doesn't support fp16 actually. It introduces avx512-fp16 in SPR microarchitecture, which will be released to public early next year. We are investigating to add support for it.

Any update on this issue?

yufenglee commented 1 year ago

It is expected. ORT computes fp16 with fp32 operators on CPU now. Intel CPUs doesn't support fp16 actually. It introduces avx512-fp16 in SPR microarchitecture, which will be released to public early next year. We are investigating to add support for it.

Any update on this issue?

We are adding fp16 support on ARM64 and targets to support heavy operators(MatMul, Conv) and operators in popular models(mobilenet and other CNN models) in 1.15.

lucasjinreal commented 1 year ago

Any updates on this in 2023???? I got CUDA fp16 slower than fp32, this is really out of my expectation!!

➜ python .\scripts\test_ort.py .\models\body3d_full_fp16.onnx
0.046171860694885256
🐍(base) 👽 Administrator in  …\xx on  main [?] took 34s
➜ python .\scripts\test_ort.py .\models\body3d_full.onnx
0.04198937177658081

the inputs changed to according fp16 already, but the fp16 speed slower than fp32, am using latest 1.14.1 python version.

Same bahavior on C++ side, Please take a look!! @microsoft !@faxu

jasonsi1993 commented 1 year ago

It is expected. ORT computes fp16 with fp32 operators on CPU now. Intel CPUs doesn't support fp16 actually. It introduces avx512-fp16 in SPR microarchitecture, which will be released to public early next year. We are investigating to add support for it.

Any update on this issue?

We are adding fp16 support on ARM64 and targets to support heavy operators(MatMul, Conv) and operators in popular models(mobilenet and other CNN models) in 1.15. Is it done already? I am experiencing fp16 slower than fp32 on arm devices? Might know the cause of it?

polodealvarado commented 9 months ago

Hello everyone. Any update about this issue? I have got similar results to yours @lucasjinreal

spoorgholi74 commented 3 months ago

same question my float 16 model is optimized using ONNX and is even slightly slower than float32.

DakeQQ commented 3 months ago

The same issue occurred in the arm64-v8a Android ONNX Runtime. The tool _onnxruntime.tools.convert_onnx_models_toort automatically adds a Cast Op to convert all calculations from FP16 to FP32, resulting in worse performance. This process appears to originate from line 1305 in inference_session.cc : CastFloat16Transformer. Does someone have any ideas on how to solve this?