triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8k stars 1.44k forks source link

vLLM/OpenAI Compatible Endpoint #6968

Open Elsayed91 opened 6 months ago

Elsayed91 commented 6 months ago

Is your feature request related to a problem? Please describe. vLLM backend works well and is easy to set up, compared to TensorRT which had me pulling my hair.

However it lacks the OpenAI compatible endpoint that ships with vLLM itself.

The /generate endpoint on its own requires work to setup for chat applications (that I honestly don't know how to do).

In essence, just by adopting vLLM triton instead of vLLM, you have to develop classes and interfaces for all these things.

Not to mention that LangChain has no LLM implementation and LlamaIndex's is a bit primitive, undocumented and bugs out.

Describe the solution you'd like Include vLLM's OpenAI compatible endpoint as an endpoint while using Triton.

Additional context Pros:

It would be wonderful if it existed as a feature for all backends, but for now, with vLLM's implementation as reference, maybe that is the best starting point.

https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/serving_chat.py https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/openai/api_server.py https://github.com/npuichigo/openai_trtllm/tree/main

lkomali commented 5 months ago

@Elsayed91 I filed a feature request to the team. DLIS-6323

gongyifeiisme commented 5 months ago

Not supporting openai style made me abandon it outright

panpan0000 commented 4 months ago

any update or progress on this ?

nnshah1 commented 4 months ago

@panpan0000 , @Elsayed91 is improved integration with llamaindex / langchain the goal or is direct support?

Would support via the python in process api be sufficient or is c/ c++ implementation required?

panpan0000 commented 3 months ago

@panpan0000 , @Elsayed91 is improved integration with llamaindex / langchain the goal or is direct support?

Would support via the python in process api be sufficient or is c/ c++ implementation required?

I don't quite understand what you mentioned ..sorry @nnshah1 this is a similar issue which may help to clarify https://github.com/triton-inference-server/server/issues/6583