I wasn't sure whether to place this under bug or whether it works as intended
I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high nv_inference_compute_input_duration_us, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?
From what I see in ModelInstanceState::SetInputTensorshttps://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call to CreateTensorWithDataAsOrtValue is it possible that this could result in seperate GPU allocations and copies therefore a long nv_inference_compute_input_duration_us? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?
Hi!
I'm currently facing an issue where a model, deployed via Triton ONNX Backend, with up to a hundred inputs has a relatively high
nv_inference_compute_input_duration_us
, which from my understanding this metric also includes copying tensor data to GPU. Is there a possibility that each input results in a seperate GPU allocator call?From what I see in
ModelInstanceState::SetInputTensors
https://github.com/triton-inference-server/onnxruntime_backend/blob/main/src/onnxruntime.cc#L2273 inputs are processed sequentially and each input results in a call toCreateTensorWithDataAsOrtValue
is it possible that this could result in seperate GPU allocations and copies therefore a longnv_inference_compute_input_duration_us
? Or is copying tensor data to GPU happening before a request is passed to the ONNX Backend?