Open koukan3 opened 1 year ago
[In https://github.com/microsoft/onnxruntime/issues/11133] (https://github.com/microsoft/onnxruntime/issues/11133#issuecomment-1335999268), there is comment that If you've inputs on the CPU and want them to be on the GPU prior to calling Run, you need to bind each input and then call SynchronizeBoundInputs
. Otherwise, you will encounter the issue known as "data race".
@pranavsharma, @faxu, we need update the API document to put it in examples. The first example in the section Data on device
of https://onnxruntime.ai/docs/api/python/api_summary.html does not have synchronize inputs. The API of IOBinding does not have descriptions about synchronize_inputs and other new functions like get_outputs_as_ortvaluevector, clear_binding_inputs etc as in source.
Describe the issue
Hi, I split two modules from bart model: encoder and decoder, then export them to onnx model. InferenceSession of encoder run with iobinding, code snippets are list as following:
encoder_session_binding.bind_cpu_input('input_ids', input_ids)
encoder_session_binding.bind_output('hidden_states', device)
encoder_session.run_with_iobinding(encoder_session_binding)
ret = encoder_session_binding.get_outputs()
The service is launched with flask, and everything is ok when sending request one by one. I use JMeter to do performance stress testing, after the number of threads increase, some unexpected errors occur on some input data randomly.Testing these input data on which errors occur one by one, there is no error and the inference outputs are correct.
I read the answers of the issue, it seems that io binding is a blocking call. when cross-device copies have not been completed, unexcepted data of null will be returned? Why does the exception only happen on the condition of multi threads?
To reproduce
encoder_session_binding.bind_cpu_input('input_ids', input_ids)
encoder_session_binding.bind_output('hidden_states', device)
encoder_session.run_with_iobinding(encoder_session_binding)
ret = encoder_session_binding.get_outputs()
Urgency
Yes
Platform
Linux
OS Version
2.0
ONNX Runtime Installation
Released Package
ONNX Runtime Version or Commit ID
1.12.1
ONNX Runtime API
Python
Architecture
X64
Execution Provider
CUDA
Execution Provider Library Version
CUDA 11.4
Model File
No response
Is this a quantized model?
No