[Feature]: load/unload API to run multiple LLMs in a single GPU instance

lizzzcai commented 4 months ago

🚀 The feature, motivation and pitch

The feature request is to add support for a load/unload endpoint/API in vLLM to dynamically load and unload multiple LLMs within a single GPU instance. This feature aims to enhance resource utilization and scalability by allowing concurrent operation of multiple LLMs on the same GPU.

The load/unload endpoint in vLLM facilitates:

Increased Resource Utilization: Enables concurrent operation of multiple LLMs on a single GPU, optimizing computational resources and system efficiency.
Enhanced Scalability: Allows dynamic model loading and unloading based on demand, adapting to varying workloads and user requirements.
Improved Cost-effectiveness: Maximizes throughput and performance without additional hardware investments, ideal for organizations with budget constraints.

Alternatives

Alternatively, providing an API for manual model unloading offers finer control over resource management.

Additional context

models here in my context are mainly small LLM (<= 10B).
Several community members have raised issue to unload models or release GPU memory in vLLM. While workarounds exist, their efficacy is inconsistent. It is hoped that official support for these functions can be implemented.

xansar commented 3 months ago

When TP=2，reload a model via vllm may introduce a bug RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

bhavnicksm commented 2 months ago

When TP=2，reload a model via vllm may introduce a bug RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I am getting this error, without the reload itself. The first load with TP=2 is giving me this error with vllm v0.5.3.post1.

icowan commented 2 months ago

当 TP=2 时，通过 vllm 重新加载模型可能会引入错误RuntimeError: 无法在分叉子进程中重新初始化 CUDA。要使用具有多处理的 CUDA，必须使用 'spawn' 启动方法

我收到此错误，但并未重新加载。第一次加载 TP=2 时，vllm v0.5.3.post1 出现此错误。

too.

gpus: A100 x 4 vllm version: vllm v0.5.3.post1 model: Qwen2-7B-Instruct cpu: 64core memory: 512GB

etemiz commented 2 weeks ago

i will be another happy user if this is implemented.

vllm-project / vllm