Open hahmad2008 opened 6 months ago
This is the info printed from allocated cache blocks,for starting two models sequentially on A10 with 23G RAM for gpu_memory_utilization: 0.2
and @serve.deployment(ray_actor_options={"num_gpus": 0.2},)
I printed the values for starting model 1 (tinyllama 1b):
(ServeReplica:model1:MyModel pid=654872) free_gpu_memory: 20509491200 total_gpu_memory: 23609475072
(ServeReplica:model1:MyModel pid=654872) peak_memory: 3099983872
(ServeReplica:model1:MyModel pid=654872) head_size: 64 num_heads: 4 num_layers: 22
(ServeReplica:model1:MyModel pid=654872) cache_block_size: 360448
(ServeReplica:model1:MyModel pid=654872) num_gpu_blocks: 4499
(ServeReplica:model1:MyModel pid=654872) total_gpu_memory: 23609475072, gpu_memory_utilization: 0.2, peak_memory: 3099983872, cache_block_size: 360448
(ServeReplica:model1:MyModel pid=654872) INFO 04-22 16:24:51 llm_engine.py:322] # GPU blocks: 4499, # CPU blocks: 11915
And this for starting model2
(ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 16070606848 total_gpu_memory: 23609475072
(ServeReplica:model2:MyModel pid=658303) peak_memory: 7538868224
(ServeReplica:model2:MyModel pid=658303) head_size: 64 num_heads: 4 num_layers: 22
(ServeReplica:model2:MyModel pid=658303) cache_block_size: 360448
(ServeReplica:model2:MyModel pid=658303) num_gpu_blocks: -7816
(ServeReplica:model1:MyModel pid=654872) total_gpu_memory: 23609475072, gpu_memory_utilization: 0.2, peak_memory: 7538868224, cache_block_size: 360448
@rkooo567 @ywang96 Could you please check this issue?
Hmm not sure if sharing 1 GPU for 2 models is supported from vllm. At least I don't know any test related to it @simon-mo @ywang96 do you guys know?
I find that vllm use this to get the total memory based on this free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
which doesn't follow what replica has which is the 0.2 of the GPU memory!! this still see the entire memory size!
@rkooo567
I just modify the code after starting the first model, so I edit the
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
to be
(ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 13923123200 total_gpu_memory: 16070606848 (replica I added)
instead of the generated:
(ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 16070606848 total_gpu_memory: 23609475072
And it works fine! the problem in the total_gpu_memory
that return the entire memory of GPU which is wrong! should return what memory that this replica can see based on the its allocation
@hahmad2008 , could you please clarify what you mean by I just modify the code after starting the first model
?
@oandreeva-nv I start the first model based without change anything in the code, then before start the second model, I change the free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!
Your current environment
vllm 0.3.0 ray 2.9.2
π Describe the bug
I am trying to serve two models (tinyllama 1b) on the same GPU. I have a cluster of A10 GPU (22G RAM), so I use
@serve.deployment(ray_actor_options={"num_gpus": 0.4},)
and ENGINE_ARGS = AsyncEngineArgs( gpu_memory_utilization= 0.4, model=model_path, max_model_len=128, enforce_eager=True,)
I can only start model on a replica with 40% of GPU and model reserved 10G/22G (GPU RAM). However when I tried to start the second model I got this error, although it created another replica and the usage of the cluster now 0.8/1 from GPU.