triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.82k stars 1.42k forks source link

Is there any plan to open source Inflight Batching for LLM Serving? #6358

Open liuyang-my opened 10 months ago

liuyang-my commented 10 months ago

We are using Triton Inference Server for model inference and currently facing throughput bottlenecks with LLM inference. I saw in a public video that Nvidia has optimized LLM serving by supporting Inflight Batching on the Triton Inference Server. I would like to know if there is a possibility of open sourcing this technology. If there is no plan for open sourcing, how can we achieve similar functionality in the existing Triton Python Backend?

image

Additionally, both TGI and vLLM have proposed Continuous Batch, and we have also considered implementing Continuous Batch on the Triton Python Backend. However, the cost of making such changes is high. So, what are the best practices for implementing LLM Batch on Triton?

nnshah1 commented 10 months ago

A few points here:

1) We have a tutorial for using vLLM with Triton (https://github.com/triton-inference-server/tutorials/blob/main/Quick_Deploy/vLLM/README.md). The integration supports continuous batching. We plan further enhancements to the tutorial as well.

2) We are in the process of developing a backend for TRT-LLM (https://developer.nvidia.com/blog/nvidia-tensorrt-llm-supercharges-large-language-model-inference-on-nvidia-h100-gpus/) which will have native inflight batching support.

3) In the existing python backend you can implement continuous batching strategies using decoupled mode and spawning an internal execution thread to manage the inflight requests while still handling new requests. This would be similar to implementing a strategy like the integration with vLLM.

4) We also welcome feedback / design thoughts on implementing additional batching strategies in the core. We are actively investigating how to extend the existing schedulers to support such strategies but don't have a plan to share yet.

nnshah1 commented 3 months ago

An update I wanted to share - we now have a support for iterative sequences within the sequence batcher. This provides basic support for including new and old sequences at each iteration of a generative sequence:

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_7-iterative_scheduling

https://github.com/triton-inference-server/server/blob/44bc109df0e780d050856bb58fbbfba9476e9f26/docs/user_guide/model_configuration.md#iterative-sequences

Any feedback is welcome as we are still experimenting with the best ways to generalize and showcase how iterative sequences can be used for generative models.