vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.89k stars 3.78k forks source link

Support Multiple Models #299

Closed aldrinc closed 3 hours ago

aldrinc commented 1 year ago
zhuohan123 commented 1 year ago

For first and second feature request, why don't you just kill the old server and start a new one with the new model?

aldrinc commented 1 year ago

It’s what we are doing now but takes long time to download and load large models (33/65b). For train to deploy pipeline not possible to have zero downtime without multiple servers and blue green deployment strategy.

zhuohan123 commented 1 year ago

The downloading should only happen at the very first time when using a model. However, the loading cost is unavoidable. Are you looking for something that can swap the model with zero downtime?

gesanqiu commented 1 year ago

I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it on vLLM.

ft-algo commented 11 months ago

Is there any progress on this feature?

shixianc commented 10 months ago

+1

Ki6an commented 10 months ago

+1

corticalstack commented 7 months ago

+1 For enterprise, instead of one monolithic, API-based LLM like GPT4, their strategy may be a collection of SLMs dedicated/fine-tuned to specific tasks. THis is why they will want to serve multiple models simultaneously, for e.g. phi2/mistral-7B/yi-34B. Could we have an update on this feature request please?

capybarahero commented 6 months ago

+1

chanchimin commented 4 months ago

+1 how is the progress of this feature

lenartgolob commented 3 months ago

+1

Shamepoo commented 3 months ago

+1

mjtechguy commented 2 months ago

+1

jvlinsta commented 2 months ago

+1

mohittalele commented 2 months ago

+1

ptrmayer commented 2 months ago

+1

tarson96 commented 2 months ago

++++1

ptrmayer commented 2 months ago

+1

Luffyzm3D2Y commented 2 months ago

+1

srzer commented 2 months ago

+1 desperately need multiple models

lizhipengpeng commented 1 month ago

+1

amitm02 commented 1 month ago

+1

naturomics commented 1 month ago

+1

servient-ashwin commented 1 month ago

What is the general thought process or strategy into implementing something like this if this is open for contributions? Does the vLLM team or anyone have a roadmap or idea into it's implementation that someone can pick up?

This would be highly useful. All I can quickly think of is how tensorflow allows to specify memory used something similar to gpu util here on vLLM or using subprocesses within python (although I must admit I do not understand much about this topic in general anyway)

zr-idol commented 4 weeks ago

+1

Sk4467 commented 4 weeks ago

One work around that I notice vLLM already gives is using Docker containers. Use docker compose to serve multiple models that way two or more models can be served at a time based on your memory capabilities.

ws-chenc commented 3 weeks ago

+++1

chen-j-ing commented 3 weeks ago

+1+1

ArtificialEU commented 3 weeks ago

So i can have a complex vllm setup with batching and all but can't change the current model except spin a new container ? 😅

pingbowen23 commented 3 weeks ago

+1

ismaslov commented 3 weeks ago

+1

JoursBleu commented 1 week ago

+1

nathan-weinberg commented 1 week ago

+1

kglmcodes commented 4 hours ago

+1

youkaichao commented 3 hours ago

vLLM does not support multiple models.

for anyone who is interested, please pin up multiple vllm instances, and use routers like litellm https://docs.litellm.ai/docs/simple_proxy#load-balancing---multiple-instances-of-1-model to route among vllm instances.