triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
7.7k stars 1.42k forks source link

[RFE] HandleGenerate equivalent for sagemaker_server.cc #7151

Open billcai opened 2 months ago

billcai commented 2 months ago

Is your feature request related to a problem? Please describe. At present text generation is only supported for http_server.cc and not supported in sagemaker_server.cc. This was verified using vLLM backend and Triton server. http_server.cc supports this by implementing HandleGenerate, which allows for the use of decoupled models (which vLLM backend models are).

Describe the solution you'd like Implement the equivalent of HandleGenerate for sagemaker_server.cc

Describe alternatives you've considered Using alternative servers (like DJLServing) with vLLM/TensorRT-LLM or different stacks (e.g. HuggingFace TGI)

Elaborating on this further: Certain backends (e.g. vLLM) currently runs only in decoupled model transaction policy. sagemaker_server.cc inference function checks and fails any call for models that runs with decoupled model transaction policy.

http_server.cc on the other hand has a few functions for inference. HandleInfer does the same check for decoupled model transaction policy, and fails if the models runs with decoupled model transaction policy. HandleGenerate on the other hand doesn't, and is designed for text generation purposes. Hence, seeking advice/assistance to implement HandleGenerate equivalent for sagemaker_service.cc.

rmccorm4 commented 2 months ago

Hi @billcai, thanks for raising this request! CC @nskool

kayalvizhi-kandasamy commented 1 month ago

One of our customers is interested in adopting this integration and we would like to know if it is being tagged for any milestones or releases. Thanks