vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.11k stars 4.55k forks source link

Load multiple models on the same GPU in the cluster #4242

Open hahmad2008 opened 6 months ago

hahmad2008 commented 6 months ago

Your current environment

vllm 0.3.0 ray 2.9.2

πŸ› Describe the bug

I am trying to serve two models (tinyllama 1b) on the same GPU. I have a cluster of A10 GPU (22G RAM), so I use @serve.deployment(ray_actor_options={"num_gpus": 0.4},) and ENGINE_ARGS = AsyncEngineArgs( gpu_memory_utilization= 0.4, model=model_path, max_model_len=128, enforce_eager=True,
)

I can only start model on a replica with 40% of GPU and model reserved 10G/22G (GPU RAM). However when I tried to start the second model I got this error, although it created another replica and the usage of the cluster now 0.8/1 from GPU.

024-04-21 15:12:15,310 INFO worker.py:1540 -- Connecting to existing Ray cluster at address: 10.5.8.112:6379...
2024-04-21 15:12:15,317 INFO worker.py:1715 -- Connected to Ray cluster. View the dashboard at 127.0.0.1:8265 
(ServeController pid=11766) INFO 2024-04-21 15:12:15,440 controller 11766 deployment_state.py:1545 - Deploying new version of deployment MyModel in application 'model2'. Setting initial target number of replicas to 1.
(ServeController pid=11766) INFO 2024-04-21 15:12:15,541 controller 11766 deployment_state.py:1829 - Adding 1 replica to deployment MyModel in application 'model2'.
(ServeReplica:model2:MyModel pid=66303) INFO 04-21 15:12:18 llm_engine.py:72] Initializing an LLM engine with config: model='TinyLlama/TinyLlama-1.1B-Chat-v0.1', tokenizer='TinyLlama/TinyLlama-1.1B-Chat-v0.1', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=128, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, seed=0)
(ServeReplica:model2:MyModel pid=66303) INFO 04-21 15:12:21 weight_utils.py:164] Using model weights format ['*.safetensors']
(ServeController pid=11766) ERROR 2024-04-21 15:12:24,184 controller 11766 deployment_state.py:658 - Exception in replica 'model2#MyModel#hbNcQm', the replica will be stopped.
(ServeController pid=11766) Traceback (most recent call last):
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/serve/_private/deployment_state.py", line 656, in check_ready
(ServeController pid=11766)     _, self._version = ray.get(self._ready_obj_ref)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
(ServeController pid=11766)     return fn(*args, **kwargs)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
(ServeController pid=11766)     return func(*args, **kwargs)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/_private/worker.py", line 2624, in get
(ServeController pid=11766)     raise value.as_instanceof_cause()
(ServeController pid=11766) ray.exceptions.RayTaskError(RuntimeError): ray::ServeReplica:model2:MyModel.initialize_and_get_metadata() (pid=66303, ip=10.5.8.112, actor_id=e6ee395511fd55b8b5457d7501000000, repr=<ray.serve._private.replica.ServeReplica:model2:MyModel object at 0x7fc28de7a1c0>)
(ServeController pid=11766)   File "/myenv/lib/python3.9/concurrent/futures/_base.py", line 439, in result
(ServeController pid=11766)     return self.__get_result()
(ServeController pid=11766)   File "/myenv/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
(ServeController pid=11766)     raise self._exception
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 455, in initialize_and_get_metadata
(ServeController pid=11766)     raise RuntimeError(traceback.format_exc()) from None
(ServeController pid=11766) RuntimeError: Traceback (most recent call last):
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 443, in initialize_and_get_metadata
(ServeController pid=11766)     await self._initialize_replica()
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/serve/_private/replica.py", line 182, in initialize_replica
(ServeController pid=11766)     await sync_to_async(_callable.__init__)(*init_args, **init_kwargs)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/ray/serve/api.py", line 237, in __init__
(ServeController pid=11766)     cls.__init__(self, *args, **kwargs)
(ServeController pid=11766)   File "python_script_serving.py", line 28, in __init__
(ServeController pid=11766)     self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 623, in from_engine_args
(ServeController pid=11766)     engine = cls(parallel_config.worker_use_ray,
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
(ServeController pid=11766)     self.engine = self._init_engine(*args, **kwargs)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/vllm/engine/async_llm_engine.py", line 364, in _init_engine
(ServeController pid=11766)     return engine_class(*args, **kwargs)
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 114, in __init__
(ServeController pid=11766)     self._init_cache()
(ServeController pid=11766)   File "/myenv/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 326, in _init_cache
(ServeController pid=11766)     raise ValueError("No available memory for the cache blocks. "
(ServeController pid=11766) ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.
(ServeReplica:model2:MyModel pid=66303) INFO 04-21 15:12:24 llm_engine.py:322] # GPU blocks: 0, # CPU blocks: 11915
(ServeReplica:model2:MyModel pid=66303) sys:1: RuntimeWarning: coroutine 'ingress.<locals>.decorator.<locals>.ASGIIngressWrapper.__del__' was never awaited
hahmad2008 commented 6 months ago

This is the info printed from allocated cache blocks,for starting two models sequentially on A10 with 23G RAM for gpu_memory_utilization: 0.2 and @serve.deployment(ray_actor_options={"num_gpus": 0.2},)

I printed the values for starting model 1 (tinyllama 1b):

(ServeReplica:model1:MyModel pid=654872) free_gpu_memory:  20509491200 total_gpu_memory:  23609475072
(ServeReplica:model1:MyModel pid=654872) peak_memory:  3099983872
(ServeReplica:model1:MyModel pid=654872) head_size:  64 num_heads:  4 num_layers:  22
(ServeReplica:model1:MyModel pid=654872) cache_block_size:  360448
(ServeReplica:model1:MyModel pid=654872) num_gpu_blocks:  4499
(ServeReplica:model1:MyModel pid=654872) total_gpu_memory: 23609475072, gpu_memory_utilization: 0.2, peak_memory: 3099983872, cache_block_size: 360448
(ServeReplica:model1:MyModel pid=654872) INFO 04-22 16:24:51 llm_engine.py:322] # GPU blocks: 4499, # CPU blocks: 11915

And this for starting model2

(ServeReplica:model2:MyModel pid=658303) free_gpu_memory:  16070606848 total_gpu_memory:  23609475072
(ServeReplica:model2:MyModel pid=658303) peak_memory:  7538868224
(ServeReplica:model2:MyModel pid=658303) head_size:  64 num_heads:  4 num_layers:  22
(ServeReplica:model2:MyModel pid=658303) cache_block_size:  360448
(ServeReplica:model2:MyModel pid=658303) num_gpu_blocks:  -7816
(ServeReplica:model1:MyModel pid=654872) total_gpu_memory: 23609475072, gpu_memory_utilization: 0.2, peak_memory: 7538868224, cache_block_size: 360448
hahmad2008 commented 6 months ago

@rkooo567 @ywang96 Could you please check this issue?

rkooo567 commented 6 months ago

Hmm not sure if sharing 1 GPU for 2 models is supported from vllm. At least I don't know any test related to it @simon-mo @ywang96 do you guys know?

hahmad2008 commented 6 months ago

I find that vllm use this to get the total memory based on this free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()

which doesn't follow what replica has which is the 0.2 of the GPU memory!! this still see the entire memory size!

hahmad2008 commented 6 months ago

@rkooo567 I just modify the code after starting the first model, so I edit the free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info() to be (ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 13923123200 total_gpu_memory: 16070606848 (replica I added) instead of the generated: (ServeReplica:model2:MyModel pid=658303) free_gpu_memory: 16070606848 total_gpu_memory: 23609475072

And it works fine! the problem in the total_gpu_memory that return the entire memory of GPU which is wrong! should return what memory that this replica can see based on the its allocation

oandreeva-nv commented 6 months ago

@hahmad2008 , could you please clarify what you mean by I just modify the code after starting the first model ?

hahmad2008 commented 6 months ago

@oandreeva-nv I start the first model based without change anything in the code, then before start the second model, I change the free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!