Support Multiple Models

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

https://docs.vllm.ai

Apache License 2.0

25.89k stars 3.78k forks source link

Support Multiple Models #299

Closed aldrinc closed 3 hours ago

aldrinc commented 1 year ago

Allow user to specify multiple models to download when loading server
Allow user to switch between models
Allow user to load multiple models on the cluster (nice to have)

zhuohan123 commented 1 year ago

For first and second feature request, why don't you just kill the old server and start a new one with the new model?

aldrinc commented 1 year ago

It’s what we are doing now but takes long time to download and load large models (33/65b). For train to deploy pipeline not possible to have zero downtime without multiple servers and blue green deployment strategy.

zhuohan123 commented 1 year ago

The downloading should only happen at the very first time when using a model. However, the loading cost is unavoidable. Are you looking for something that can swap the model with zero downtime?

gesanqiu commented 1 year ago

I think multi-models is important to some bussiness logic like ensemble model and langchain application, do you have any ideas I can reference, then I will try to implement it on vLLM.

ft-algo commented 11 months ago

Is there any progress on this feature?

shixianc commented 10 months ago

Ki6an commented 10 months ago

corticalstack commented 7 months ago

+1 For enterprise, instead of one monolithic, API-based LLM like GPT4, their strategy may be a collection of SLMs dedicated/fine-tuned to specific tasks. THis is why they will want to serve multiple models simultaneously, for e.g. phi2/mistral-7B/yi-34B. Could we have an update on this feature request please?

capybarahero commented 6 months ago

chanchimin commented 4 months ago

+1 how is the progress of this feature

lenartgolob commented 3 months ago

Shamepoo commented 3 months ago

mjtechguy commented 2 months ago

jvlinsta commented 2 months ago

mohittalele commented 2 months ago

ptrmayer commented 2 months ago

tarson96 commented 2 months ago

++++1

ptrmayer commented 2 months ago

Luffyzm3D2Y commented 2 months ago

srzer commented 2 months ago

+1 desperately need multiple models

lizhipengpeng commented 1 month ago

amitm02 commented 1 month ago

naturomics commented 1 month ago

servient-ashwin commented 1 month ago

What is the general thought process or strategy into implementing something like this if this is open for contributions? Does the vLLM team or anyone have a roadmap or idea into it's implementation that someone can pick up?

This would be highly useful. All I can quickly think of is how tensorflow allows to specify memory used something similar to gpu util here on vLLM or using subprocesses within python (although I must admit I do not understand much about this topic in general anyway)

zr-idol commented 4 weeks ago

Sk4467 commented 4 weeks ago

One work around that I notice vLLM already gives is using Docker containers. Use docker compose to serve multiple models that way two or more models can be served at a time based on your memory capabilities.