Open lizzzcai opened 4 months ago
When TP=2,reload a model via vllm may introduce a bug RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
When TP=2,reload a model via vllm may introduce a bug RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
I am getting this error, without the reload itself. The first load with TP=2 is giving me this error with vllm v0.5.3.post1.
当 TP=2 时,通过 vllm 重新加载模型可能会引入错误RuntimeError: 无法在分叉子进程中重新初始化 CUDA。要使用具有多处理的 CUDA,必须使用 'spawn' 启动方法
我收到此错误,但并未重新加载。第一次加载 TP=2 时,vllm v0.5.3.post1 出现此错误。
too.
gpus: A100 x 4 vllm version: vllm v0.5.3.post1 model: Qwen2-7B-Instruct cpu: 64core memory: 512GB
i will be another happy user if this is implemented.
🚀 The feature, motivation and pitch
The feature request is to add support for a load/unload endpoint/API in vLLM to dynamically load and unload multiple LLMs within a single GPU instance. This feature aims to enhance resource utilization and scalability by allowing concurrent operation of multiple LLMs on the same GPU.
The load/unload endpoint in vLLM facilitates:
Increased Resource Utilization: Enables concurrent operation of multiple LLMs on a single GPU, optimizing computational resources and system efficiency.
Enhanced Scalability: Allows dynamic model loading and unloading based on demand, adapting to varying workloads and user requirements.
Improved Cost-effectiveness: Maximizes throughput and performance without additional hardware investments, ideal for organizations with budget constraints.
Alternatives
Alternatively, providing an API for manual model unloading offers finer control over resource management.
Additional context