vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
29.34k stars 4.39k forks source link

[Feature]: load/unload API to run multiple LLMs in a single GPU instance #5491

Open lizzzcai opened 4 months ago

lizzzcai commented 4 months ago

🚀 The feature, motivation and pitch

The feature request is to add support for a load/unload endpoint/API in vLLM to dynamically load and unload multiple LLMs within a single GPU instance. This feature aims to enhance resource utilization and scalability by allowing concurrent operation of multiple LLMs on the same GPU.

The load/unload endpoint in vLLM facilitates:

Alternatives

Alternatively, providing an API for manual model unloading offers finer control over resource management.

Additional context

xansar commented 3 months ago

When TP=2,reload a model via vllm may introduce a bug RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

bhavnicksm commented 2 months ago

When TP=2,reload a model via vllm may introduce a bug RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

I am getting this error, without the reload itself. The first load with TP=2 is giving me this error with vllm v0.5.3.post1.

icowan commented 2 months ago

当 TP=2 时,通过 vllm 重新加载模型可能会引入错误RuntimeError: 无法在分叉子进程中重新初始化 CUDA。要使用具有多处理的 CUDA,必须使用 'spawn' 启动方法

我收到此错误,但并未重新加载。第一次加载 TP=2 时,vllm v0.5.3.post1 出现此错误。

too.

gpus: A100 x 4 vllm version: vllm v0.5.3.post1 model: Qwen2-7B-Instruct cpu: 64core memory: 512GB

etemiz commented 2 weeks ago

i will be another happy user if this is implemented.