Inflight Batching via Python Client

hackassin commented 8 months ago

Hi Team,

Any updates on Inflight Batching support with Triton via Python client?

Thanks!

byshiue commented 8 months ago

It is not supported yet. We are working on it. There will also be a standard solution to integrate TRT-LLM with the Python backend of Triton soon.

lyc728 commented 8 months ago

It is not supported yet. We are working on it. There will also be a standard solution to integrate TRT-LLM with the Python backend of Triton soon.

Could you please tell me how long it will be online? I am looking forward to it. Thanks

pcastonguay commented 7 months ago

Just to be clear, you can deploy the c++ trt-llm backend (with in-flight batching capabilities) in Triton, and then use a Python client to send requests to it. Example clients can be found in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/inflight_batcher_llm/client for example.

We have also added a Python BLS backend that can be used to implement more complex logic when orchestrating the preprocessor, the trt-llm backend and the postprocessor. See https://github.com/triton-inference-server/tensorrtllm_backend/blob/main/all_models/inflight_batcher_llm/tensorrt_llm_bls/1/model.py

If you are asking about a python backend with the same functionality as the C++ trt-llm backend in https://github.com/triton-inference-server/tensorrtllm_backend/tree/main/all_models/inflight_batcher_llm/tensorrt_llm, this is still under development. We are still actively working on it but cannot commit to a date yet.

lly-zero-one commented 3 months ago

HI, Any update on the python backend? It could be a blocker for the adoption of some json formatter package.

pcastonguay commented 3 months ago

It uses the Python bindings to the C++ executor API.

triton-inference-server / tensorrtllm_backend