Closed fj-y-saito closed 7 months ago
CC @hariharans29 @snnn @skottmckay . Can you help here.
Hello Team. This ticket is been open for a while now. Can someone share some insights on this ticket :)
CC'd few folks who might help here in this forum: @tianleiwu @souptc @jslhcl
@fj-y-saito, you can build from source and enable node outputs: https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs By comparing node outputs(redirect stdout to file, then use visual studio code to compare files from different hosts), you can find out the first node has different output.
There are many reasons could cause different outputs. For example,
You can evaluate with relevance metrics (like classification precision/recall) to see whether difference at the fifth digit having actual impact on relevance.
Thank you for your reply.
I tried to use the ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA environment value and check the output from each node. And I found the different output was started from a convolution layer. So I think the different output is caused from below point you showed me.
convolution benchmarking might choose different algorithms in different sessions.
I'm curious how onnxruntime choose different convolution algorithm in different session. If you know, would you teach me about it.
@tianleiwu Is there any update on this?
@fj-y-saito, convolution benchmarking means that it will run different convolution algorithms, and choose the fastest one. If two algorithms are very close in performance, the one chosen is not deterministic. There are many factors that could impact the results, including cuDNN version, cuda driver version, CPU, OS or even the temperature.
Another issue is that some cuDNN algorithm is not deterministic. For example, PyTorch has a flag torch.backends.cudnn.determinstic
to filter out non deterministic algos. ORT currently does not have such filtering flag.
You may try set cudnn_conv_algo_search
to Default
see whether it could help:
https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#cudnn_conv_algo_search
See Examples section in that page for example code.
Since the default has less algorithms so it might exclude some algorithms that are non-deterministic, but it could be slower.
@tianleiwu I tried to set cudnn_conv_algo_search to DEFAULT and got the result below
According this, I think that the difference is caused by cuda. So I want to make simple cuda convolution program and check whether this problem is caused by onnxruntime or cuda. So I want to know cuda API used in onnxruntime. If possible, let me know the source code lines configuring this algorithm using CUDA API in onnxruntime.
@fj-y-saito, you can use Nsight profiling to see which cuda kernel is used. By comparing x86 and ARM profiling results, you can find out whether different kernel is called.
The convolution code can be found here. It is like front end code, and back end is cuDNN.
I think profiling is your best bet.
@tianleiwu I got Nsight profiling and I found out that the convolution algorithm is different between x86 and arm.
So probably this probrem is came from the different algorithm of CUDA.
thank you very much for your support. I close this issue.
Describe the issue
When running YOLOX on onnxruntime, the different values are output from the two systems. One system consists from x86-CPU + A100-GPU. The other system consists from Arm-CPU + A100-GPU.
Query 1: The output mismatch between the results starts at the fifth digit at most after the decimal place. Is this margin of error acceptable fo onnxruntime even if we are using the exact same GPU to run inference and only a different CPU?
Query 2: Can this output mismatch issue should be fixed?
To reproduce
This is one of the return value at my environment.
x86-CPU + A100
Arm-CPU + A100-GPU
This is reproduction program.
The test program uses the following models: YOLOX-ONNX-TFLite-Sample/yolox and YOLOX-ONNX-TFLite-Sample/model Run the attached python program in the same directory. https://github.com/Kazuhito00/YOLOX-ONNX-TFLite-Sample
tp.zip
Reproduction Environment
Run in a docker container set up in the following environment
Machines Used
Other Environment
The python version is 3.10.12. The docker image is nvcr.io/nvidia/pytorch: 23.06-py3.
onnxruntime Build Options
In order to use A100, the following was added to CMakeLists.txt. (https://github.com/microsoft/onnxruntime/blob/v1.15.1/cmake/CMakeLists.txt#L1284)
Checked out and used the v 1.15.1 tag for onnxruntime. I specified the following build options
[Arm-CPU + A100-GPU]
[x86-CPU + A100-GPU]
Urgency
This is blocking for us.
Platform
Linux
OS Version
Ubuntu 22.04.3 LTS
ONNX Runtime Installation
Built from Source
ONNX Runtime Version or Commit ID
baeece44ba075009c6bfe95891a8c1b3d4571cb3
ONNX Runtime API
Python
Architecture
ARM64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 12.1