[Bug]: initializing multiple LLM classes simultaneously on the same GPU get an error

TonyUSTC commented 1 month ago

Your current environment

python: 3.8 cuda: 11.8 vllm: 0.5.5+cu118

Model Input Dumps

No response

🐛 Describe the bug

my llm model is qwen2 1.5b，so i want to initialize multiple workers on one single GPU(T4，16G mem) for high throughput. but，when initializing multiple LLM classes simultaneously on the same GPU results in an error: _ValueError: No available memory for the cache blocks. Try increasing gpu_memoryutilization when initializing the engine. However, if initialized one by one sequentially, there is no problem. My code here，gpu_memory_utilization=0.4，so i can initialize 2 works. self.llm = LLM(model=model_path, tensor_parallel_size=1, enforce_eager=True, gpu_memory_utilization=0.4)

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

DarkLight1337 commented 1 month ago

vLLM doesn't support simultaneous usage of LLM class. We recommend you run those LLM instances in separate processes each with their own GPUs.

DarkLight1337 commented 1 month ago

There should be no need to use different LLM instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.

TonyUSTC commented 1 month ago

There should be no need to use different LLM instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.

thanks for your reply, i note that data_parallel_size argument, What does this argument mean? Is it similar to create multiple instances on a single GPU?

DarkLight1337 commented 1 month ago

There should be no need to use different LLM instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.

thanks for your reply, i note that data_parallel_size argument, What does this argument mean? Is it similar to create multiple instances on a single GPU?

This is a WIP feature. Please refer to the linked PR above for more details, and ask questions there if you are still unsure.

vllm-project / vllm