microsoft / onnxruntime

ONNX Runtime: cross-platform, high performance ML inferencing and training accelerator
https://onnxruntime.ai
MIT License
14.04k stars 2.83k forks source link

GPU inference result not stable #13178

Open xiaowuhu opened 1 year ago

xiaowuhu commented 1 year ago

Describe the issue

the 1st party model is in the internal email.

do inference on both GPU and CPU, the output is:

GPU inference 0= [array([[0.46005446, 0.53994554]], dtype=float32)] GPU inference 1= [array([[0.46167108, 0.53832895]], dtype=float32)]

CPU inference 0= [array([[0.45498496, 0.545015 ]], dtype=float32)] CPU inference 1= [array([[0.45498496, 0.545015 ]], dtype=float32)] expected: two times GPU inference result should be same, just like CPU's result. actual: the difference is big.

To reproduce

import onnx
import onnxruntime as ort

def inference(model_path, input_feed, providers=None):
    sess = ort.InferenceSession(model_path, None, providers=providers)
    output = sess.run(None, input_feed)
    return output

ort_inputs_bs1 = {
    "input_ids":[
        [0,1069,2858,264,37610,45667,67,2,2,2858,264,37610,45667,67,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
    ],
    "attention_mask":[
        [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    ],
    "token_type_ids":[
        [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
    ]
}

for i in range(2):
    res = inference("model/graph.onnx", ort_inputs_bs1, ['CUDAExecutionProvider'])
    print("GPU inference %d=" % i,res)

for i in range(2):
    res = inference("model/graph.onnx", ort_inputs_bs1, ['CPUExecutionProvider'])
    print("CPU inference %d=" % i,res)

Urgency

ASAP

Platform

Linux

OS Version

ubuntun 20

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4

yetingqiaqia commented 1 year ago

Hi ORT team, Is there any update? This blocks our users. Thanks.

Here is the model path: https://drive.google.com/file/d/1MbmTOLvr5U-RbZ08rJxf6E16Eg-4GUx_/view?usp=sharing • To test, please simply run “python fp16_convert.py” • To test with different batch_size, please change the inputs within “python fp16_convert.py”. In the test file, I created 4 inputs with different batch_size: from batch_size=1, 2, 8, to 32.