Open CapJunkrat opened 3 weeks ago
Results can't be the same because of the parallelism. The same operations are done in cuda and cpu but not necessarily in the same order. It should be much closer for very small tensors because there is no parallelism but it could be still in a slightly different order.
Describe the issue
I tried to use CPUExecutionProvider and CUDAExecutionProvider to inference the same single conv node, and turns out the result does not match after 4 decimals. I'm wondering if this is expected and normal.
To reproduce
To generate this model:
To inference and compare:
Urgency
No response
Platform
Linux
OS Version
centos 7 & ubuntu 22.04
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.16.3
ONNX Runtime API
Python
Architecture
X64
Execution Provider
Default CPU, CUDA
Execution Provider Library Version
cuda 11.8