Open cqray1990 opened 1 year ago
Would it be possible for you to upgrade to the latest ORT and share the model?
One possible reason is that some ops do not have fp16 type support on CUDA ep and they fall back to CPU
Would it be possible for you to upgrade to the latest ORT and share the model?
One possible reason is that some ops do not have fp16 type support on CUDA ep and they fall back to CPU
also i use onnxtuntime-gpu 1.13, is slower than fp32 model
Same here, 1.13
Describe the issue
inference speed is very slow when using fp16 while using fp 32 is normal
To reproduce
inference speed is very slow when using fp16 while using fp 32 is normal
Urgency
inference speed is very slow when using fp16
Platform
Linux
OS Version
18
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
onnxruntime-gpu 1.7.0
ONNX Runtime API
Python
Architecture
X86
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.0
Model File
No response
Is this a quantized model?
No