Open leiwen83 opened 7 months ago
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
What do you think about the option of spinning up multiple vLLM instances, and a router like LiteLLM that orchestrate the vLLM endpoint in a single full OpenAI Compatible API?
🚀 The feature, motivation and pitch
In the production scenario, multiple model registration is a needed feature, which could be served as auto scale or model update case or centralized service dispatch accessed from fixed URL. Previously, we use fastchat w/ vllm, and it works well to serve our purpose.
But nowadays vllm get rapidly expansion in it LLM support feature like images/video, etc, and also engine args grows to support various need, fastchat‘s provided openai interface seems cannot keep up the pace with the changes of vllm side.
So shall we consider to host some kind of function just like fastchat's controller feature, and model worker could be loosely-coupled with controller, and dynamic leave and register into controller's server backend, while controler could choose the best route for certain prompt request?
Alternatives
I'm not sure whether there is some other openai API server could does this controller/work loosely-coupled working mode well, and also could keep sync with vllm's quickly changing API.
Additional context
No response