microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.1k stars 2.84k forks source link

inference speed is very slow when using fp16 while using fp 32 is normal #15170

Open cqray1990 opened 1 year ago

cqray1990 commented 1 year ago

Describe the issue

inference speed is very slow when using fp16 while using fp 32 is normal

To reproduce

inference speed is very slow when using fp16 while using fp 32 is normal

Urgency

inference speed is very slow when using fp16

Platform

Linux

OS Version

18

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

onnxruntime-gpu 1.7.0

ONNX Runtime API

Python

Architecture

X86

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.0

Model File

No response

Is this a quantized model?

No

wangyems commented 1 year ago

Would it be possible for you to upgrade to the latest ORT and share the model?

One possible reason is that some ops do not have fp16 type support on CUDA ep and they fall back to CPU

cqray1990 commented 1 year ago

Would it be possible for you to upgrade to the latest ORT and share the model?

One possible reason is that some ops do not have fp16 type support on CUDA ep and they fall back to CPU

also i use onnxtuntime-gpu 1.13, is slower than fp32 model

lucasjinreal commented 1 year ago

Same here, 1.13