triton-inference-server / onnxruntime_backend

The Triton backend for the ONNX Runtime.
BSD 3-Clause "New" or "Revised" License
134 stars 57 forks source link

[Question] Multiple model inputs and GPU allocations #269

Open msyulia opened 2 months ago

msyulia commented 2 months ago

Hi!

I wasn't sure whether to place this under bug or whether it works as intended

I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high nv_inference_compute_input_duration_us, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?

From what I see in ModelInstanceState::SetInputTensors https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call to CreateTensorWithDataAsOrtValue is it possible that this could result in seperate GPU allocations and copies therefore a long nv_inference_compute_input_duration_us? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?