ONNXRT produces different outputs with CUDA EP on A100 GPU when used with different hosts (x86 CPU and ARM CPU).

fj-y-saito commented 9 months ago

Describe the issue

When running YOLOX on onnxruntime, the different values are output from the two systems. One system consists from x86-CPU + A100-GPU. The other system consists from Arm-CPU + A100-GPU.

Query 1: The output mismatch between the results starts at the fifth digit at most after the decimal place. Is this margin of error acceptable fo onnxruntime even if we are using the exact same GPU to run inference and only a different CPU?

Query 2: Can this output mismatch issue should be fixed?

To reproduce

This is one of the return value at my environment.

x86-CPU + A100

========== http://images.cocodataset.org/val2017/000000039769.jpg ==========
# bboxes

[[322.23031616  32.08716965 610.45410156 383.88278198]
 [  0.          51.8800354  332.015625   404.17974854]
 [  0.           1.69433606 637.11358643 464.43606567]]

# score

[0.52616191 0.52030879 0.50786453]

Arm-CPU + A100-GPU

========== http://images.cocodataset.org/val2017/000000039769.jpg ==========
# bboxes

[[322.22915649  32.08049011 610.46142578 383.88363647]
 [  0.          51.87286377 332.02368164 404.17492676]
 [  0.           1.69342053 637.13439941 464.42224121]]

# score

[0.52607542 0.51986152 0.50808865]

This is reproduction program.

The test program uses the following models: YOLOX-ONNX-TFLite-Sample/yolox and YOLOX-ONNX-TFLite-Sample/model Run the attached python program in the same directory. https://github.com/Kazuhito00/YOLOX-ONNX-TFLite-Sample

tp.zip

Reproduction Environment

Run in a docker container set up in the following environment

Machines Used

CPU	GPU
Neoverse-N1	NVIDIA A100
Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz	NVIDIA A100

Other Environment

The python version is 3.10.12. The docker image is nvcr.io/nvidia/pytorch: 23.06-py3.

onnxruntime Build Options

In order to use A100, the following was added to CMakeLists.txt. (https://github.com/microsoft/onnxruntime/blob/v1.15.1/cmake/CMakeLists.txt#L1284)

set(CMAKE_CUDA_FLAGS "${CMAKE_CUDA_FLAGS} -gencode=arch=compute_80,code=sm_80") # for A100

Checked out and used the v 1.15.1 tag for onnxruntime. I specified the following build options

[Arm-CPU + A100-GPU]

--build_wheel
--use_cuda
--cudnn_home /usr/lib/aarch64-linux-gnu
--cuda_home /usr/local/cuda
--parallel
--config Release
--cmake_extra_defines CMAKE_OSX_ARCHITECTURES=arm64
--allow_running_as_root
--skip_tests

[x86-CPU + A100-GPU]

--build_wheel
--use_cuda
--cudnn_home /usr/lib/x86_64-linux-gnu
--cuda_home /usr/local/cuda
--parallel
--config Release
--allow_running_as_root
--skip_tests

Urgency

This is blocking for us.

Platform

Linux

OS Version

Ubuntu 22.04.3 LTS

ONNX Runtime Installation

Built from Source

ONNX Runtime Version or Commit ID

baeece44ba075009c6bfe95891a8c1b3d4571cb3

ONNX Runtime API

Python

Architecture

ARM64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 12.1

maajidkhann commented 9 months ago

CC @hariharans29 @snnn @skottmckay . Can you help here.

maajidkhann commented 8 months ago

Hello Team. This ticket is been open for a while now. Can someone share some insights on this ticket :)

CC'd few folks who might help here in this forum: @tianleiwu @souptc @jslhcl

tianleiwu commented 8 months ago

@fj-y-saito, you can build from source and enable node outputs: https://onnxruntime.ai/docs/build/inferencing.html#debugnodeinputsoutputs By comparing node outputs(redirect stdout to file, then use visual studio code to compare files from different hosts), you can find out the first node has different output.

There are many reasons could cause different outputs. For example,

convolution benchmarking might choose different algorithms in different sessions.
Some cuda operators (Like SkipLayerNorm and GroupNorm) is not deterministic since they use cub. See related issue: https://github.com/NVIDIA/cccl/issues/886
Try disable TF32 in both hosts by setting environment variable NVIDIA_TF32_OVERRIDE=0 to see whether it helps.

You can evaluate with relevance metrics (like classification precision/recall) to see whether difference at the fifth digit having actual impact on relevance.

fj-y-saito commented 8 months ago

Thank you for your reply.

I tried to use the ORT_DEBUG_NODE_IO_DUMP_OUTPUT_DATA environment value and check the output from each node. And I found the different output was started from a convolution layer. So I think the different output is caused from below point you showed me.

convolution benchmarking might choose different algorithms in different sessions.

I'm curious how onnxruntime choose different convolution algorithm in different session. If you know, would you teach me about it.

fj-y-saito commented 8 months ago

@tianleiwu Is there any update on this?

tianleiwu commented 7 months ago

@fj-y-saito, convolution benchmarking means that it will run different convolution algorithms, and choose the fastest one. If two algorithms are very close in performance, the one chosen is not deterministic. There are many factors that could impact the results, including cuDNN version, cuda driver version, CPU, OS or even the temperature.

Another issue is that some cuDNN algorithm is not deterministic. For example, PyTorch has a flag torch.backends.cudnn.determinstic to filter out non deterministic algos. ORT currently does not have such filtering flag.

You may try set cudnn_conv_algo_search to Default see whether it could help: https://onnxruntime.ai/docs/execution-providers/CUDA-ExecutionProvider.html#cudnn_conv_algo_search See Examples section in that page for example code. Since the default has less algorithms so it might exclude some algorithms that are non-deterministic, but it could be slower.

fj-y-saito commented 7 months ago

@tianleiwu I tried to set cudnn_conv_algo_search to DEFAULT and got the result below

There are still differences between x86 and ARM results despite using the same GPU
With this option set, Onnxruntime now always returns the same value on x86

According this, I think that the difference is caused by cuda. So I want to make simple cuda convolution program and check whether this problem is caused by onnxruntime or cuda. So I want to know cuda API used in onnxruntime. If possible, let me know the source code lines configuring this algorithm using CUDA API in onnxruntime.

tianleiwu commented 7 months ago

@fj-y-saito, you can use Nsight profiling to see which cuda kernel is used. By comparing x86 and ARM profiling results, you can find out whether different kernel is called.

The convolution code can be found here. It is like front end code, and back end is cuDNN.

I think profiling is your best bet.

fj-y-saito commented 7 months ago

@tianleiwu I got Nsight profiling and I found out that the convolution algorithm is different between x86 and arm.

So probably this probrem is came from the different algorithm of CUDA.

thank you very much for your support. I close this issue.

microsoft / onnxruntime