Open TonyUSTC opened 1 month ago
vLLM doesn't support simultaneous usage of LLM
class. We recommend you run those LLM
instances in separate processes each with their own GPUs.
There should be no need to use different LLM
instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.
There should be no need to use different
LLM
instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.
thanks for your reply, i note that data_parallel_size argument, What does this argument mean? Is it similar to create multiple instances on a single GPU?
There should be no need to use different
LLM
instances on the same GPU because even when the model is small, we can utilize the GPU for other optimizations such as request batching and KV cache.thanks for your reply, i note that data_parallel_size argument, What does this argument mean? Is it similar to create multiple instances on a single GPU?
This is a WIP feature. Please refer to the linked PR above for more details, and ask questions there if you are still unsure.
Your current environment
python: 3.8 cuda: 11.8 vllm: 0.5.5+cu118
Model Input Dumps
No response
🐛 Describe the bug
my llm model is qwen2 1.5b,so i want to initialize multiple workers on one single GPU(T4,16G mem) for high throughput. but,when initializing multiple LLM classes simultaneously on the same GPU results in an error: _ValueError: No available memory for the cache blocks. Try increasing gpu_memoryutilization when initializing the engine. However, if initialized one by one sequentially, there is no problem. My code here,gpu_memory_utilization=0.4,so i can initialize 2 works. self.llm = LLM(model=model_path, tensor_parallel_size=1, enforce_eager=True, gpu_memory_utilization=0.4)
Before submitting a new issue...