[Performance] inference problems with io_binding: unexpected shape or unexpected data type

koukan3 commented 1 year ago

Describe the issue

Hi, I split two modules from bart model: encoder and decoder, then export them to onnx model. InferenceSession of encoder run with iobinding, code snippets are list as following: encoder_session_binding.bind_cpu_input('input_ids', input_ids)
encoder_session_binding.bind_output('hidden_states', device) encoder_session.run_with_iobinding(encoder_session_binding) ret = encoder_session_binding.get_outputs() The service is launched with flask, and everything is ok when sending request one by one. I use JMeter to do performance stress testing, after the number of threads increase, some unexpected errors occur on some input data randomly.

File "/opt/lib/python3.8/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 200, in run return self._sess.run(output_names, input_feed, run_options) onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (((null))) , expected: ((tensor(float16)))

[E:onnxruntime:, sequential executor.cc:368 Execute] Non-zero status code returned while running Add node. Name:'Add_234' Status Message: Add_234: Le ft operand cannot broadcast on dim 3 LeftShape: {1,16,1,7}, RightShape: {1,1,1,12}

Testing these input data on which errors occur one by one, there is no error and the inference outputs are correct.

I read the answers of the issue, it seems that io binding is a blocking call. when cross-device copies have not been completed, unexcepted data of null will be returned? Why does the exception only happen on the condition of multi threads?

To reproduce

encoder_session_binding.bind_cpu_input('input_ids', input_ids)
encoder_session_binding.bind_output('hidden_states', device) encoder_session.run_with_iobinding(encoder_session_binding) ret = encoder_session_binding.get_outputs()

Urgency

Yes

Platform

Linux

OS Version

2.0

ONNX Runtime Installation

Released Package

ONNX Runtime Version or Commit ID

1.12.1

ONNX Runtime API

Python

Architecture

X64

Execution Provider

CUDA

Execution Provider Library Version

CUDA 11.4

Model File

No response

Is this a quantized model?

No

tianleiwu commented 1 year ago

[In https://github.com/microsoft/onnxruntime/issues/11133] (https://github.com/microsoft/onnxruntime/issues/11133#issuecomment-1335999268), there is comment that If you've inputs on the CPU and want them to be on the GPU prior to calling Run, you need to bind each input and then call SynchronizeBoundInputs. Otherwise, you will encounter the issue known as "data race".

@pranavsharma, @faxu, we need update the API document to put it in examples. The first example in the section Data on device of https://onnxruntime.ai/docs/api/python/api_summary.html does not have synchronize inputs. The API of IOBinding does not have descriptions about synchronize_inputs and other new functions like get_outputs_as_ortvaluevector, clear_binding_inputs etc as in source.

pranavsharma commented 1 year ago

Also checkout https://onnxruntime.ai/docs/performance/tune-performance.html#iobinding

microsoft / onnxruntime