GPU inference result not stable

Describe the issue

the 1st party model is in the internal email.

do inference on both GPU and CPU, the output is:

GPU inference 0= [array([[0.46005446, 0.53994554]], dtype=float32)] GPU inference 1= [array([[0.46167108, 0.53832895]], dtype=float32)]

CPU inference 0= [array([[0.45498496, 0.545015 ]], dtype=float32)] CPU inference 1= [array([[0.45498496, 0.545015 ]], dtype=float32)] expected: two times GPU inference result should be same, just like CPU's result. actual: the difference is big.

To reproduce

import onnx
import onnxruntime as ort

def inference(model_path, input_feed, providers=None):
    sess = ort.InferenceSession(model_path, None, providers=providers)
    output = sess.run(None, input_feed)
    return output

ort_inputs_bs1 = {
    "input_ids":[
        [0,1069,2858,264,37610,45667,67,2,2,2858,264,37610,45667,67,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
    ],
    "attention_mask":[
        [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    ],
    "token_type_ids":[
        [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    ]
}

for i in range(2):
    res = inference("model/graph.onnx", ort_inputs_bs1, ['CUDAExecutionProvider'])
    print("GPU inference %d=" % i,res)

for i in range(2):
    res = inference("model/graph.onnx", ort_inputs_bs1, ['CPUExecutionProvider'])
    print("CPU inference %d=" % i,res)

Urgency

ASAP

Platform

Linux

OS Version

ubuntun 20

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4

microsoft / onnxruntime