triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.72k stars 1.42k forks source link

Feature Questions #7244

Open cha-noong opened 1 month ago

cha-noong commented 1 month ago

Since jetson supports triton inference server, I am considering applying it. So, I have a few questions.

  1. In an environment where multiple AI models are run in Jetson, is there any advantage to using Triton Inference Server compared to running them individually with TensorRT? (Triton Inference Server's Queuing optimization vs. GRPC communication latency added in LocalHost)

  2. It appears that there are system shared memory and cuda shared memory as a way to reduce LocalHost communication latency. What is the difference between the two? (The document in the link talks about the same function, https://docs.nvidia.com/deeplearning/triton-inference-server/archives/triton_inference_server_1140/user-guide/docs/client_example.html)

  3. System shared memory has been confirmed to work, but cuda shared memory produces an error like the link above. Does Jetson currently support cuda shared memory? (https://github.com/triton-inference-server/server/issues/5798)

GuanLuo commented 1 month ago
  1. In the case of serving multiple models, Triton provides the benefit on serving those models concurrently, and you can configure the models separately depending on your use case. Triton also provides support on popular machine learning frameworks if your models are not just in TensorRT. Another benefit is that, to serve your model in TRT directly, you will need to write additional code to interact with the APIs, which Triton already does it for you, so it should require less effort to deploy a model through Triton.
  2. System shared memory is for accessing CPU memory between processes and CUDA shared memory is for GPU memory. Usually you want the data to be stored closer to the device of the model, so you would explore the CUDA shared memory if your model is deployed on GPU.