vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.51k stars 4.62k forks source link

[Bug]: initializing multiple LLM classes simultaneously on the same GPU get an error #9198

Open TonyUSTC opened 1 month ago

TonyUSTC commented 1 month ago

Your current environment

python: 3.8 cuda: 11.8 vllm: 0.5.5+cu118

Model Input Dumps

No response

🐛 Describe the bug

my llm model is qwen2 1.5b,so i want to initialize multiple workers on one single GPU(T4,16G mem) for high throughput. but,when initializing multiple LLM classes simultaneously on the same GPU results in an error: _ValueError: No available memory for the cache blocks. Try increasing gpu_memoryutilization when initializing the engine. However, if initialized one by one sequentially, there is no problem. My code here,gpu_memory_utilization=0.4,so i can initialize 2 works. self.llm = LLM(model=model_path, tensor_parallel_size=1, enforce_eager=True, gpu_memory_utilization=0.4)

Before submitting a new issue...

DarkLight1337 commented 1 month ago

vLLM doesn't support simultaneous usage of LLM class. We recommend you run those LLM instances in separate processes each with their own GPUs.

DarkLight1337 commented 1 month ago

There should be no need to use different LLM instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.

TonyUSTC commented 1 month ago

There should be no need to use different LLM instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.

thanks for your reply, i note that data_parallel_size argument, What does this argument mean? Is it similar to create multiple instances on a single GPU?

DarkLight1337 commented 1 month ago

There should be no need to use different LLM instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.

thanks for your reply, i note that data_parallel_size argument, What does this argument mean? Is it similar to create multiple instances on a single GPU?

This is a WIP feature. Please refer to the linked PR above for more details, and ask questions there if you are still unsure.