triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.8k stars 1.42k forks source link

Handling Unsupported Input and Ensuring GPU Processing in Triton Inference Server #7365

Open Bycqg opened 1 month ago

Bycqg commented 1 month ago

I have configured an ensemble model in Triton Inference Server, which includes DALI preprocessing and TensorRT inference. When I uploaded a GIF image from the client, the Triton server crashed with the error "current pipeline object is no longer valid. killed" because DALI does not support GIF decoding. How can I prevent Triton from shutting down, and instead catch the exception and return a proper error response?

On a related note, I want to ensure that all data processing remains on the GPU throughout the entire pipeline (i.e., data processed by DALI on the GPU should not be transferred back to the CPU before being passed to TensorRT for inference). I believe that keeping the data on the GPU will be more efficient. Is this possible, and if so, how can it be achieved?

The version I am using is: Release 2.29.0 corresponding to NGC container 22.12.

Bycqg commented 1 month ago

Initially, I used dali.py and configured a dali_backend model in the ensemble model to preprocess images. However, with this configuration, if I uploaded a file in an incorrect format (e.g., GIF), DALI could not decode it, leading to the Triton service being killed.

Now, I have switched to using python_backend and wrote DALI preprocessing in model.py, using try...except to handle exceptions and prevent the Triton service from being killed. However, in this process, I found that the pb_utils.Tensor() function only supports parameters of type (str, numpy.ndarray). This forces me to transfer DALI GPU data back to the CPU. My intention was to directly transfer data from DALI GPU to TensorRT GPU, as I believe this would be more efficient. I would like to ask if DALI GPU data must be transferred back to the CPU before being passed on. If not, how can I achieve this (preferably with code examples or documentation)?