vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.39k stars 3.67k forks source link

Unable to specify GPU usage in VLLM code #3012

Open humza-sami opened 6 months ago

humza-sami commented 6 months ago

I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline using vLLM. Specifically, I have 4 RTX 4090 GPUs available, and I aim to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB). This is my code for running 42GB model on two GPUs.

from vllm import LLM
llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text)

However, I haven't found a straightforward method within the VLLM library to specify which GPU should be used for each model.

simon-mo commented 6 months ago

You can specify the devices by using CUDA_VISIBLE_DEVICES environment variable.

humza-sami commented 6 months ago

You can specify the devices by using CUDA_VISIBLE_DEVICES environment variable.

@simon-mo

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = "3"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

this still loads 2nd llm on 1 and 2 gpu and gives memory error

simon-mo commented 6 months ago

Try instantiate them in different script?

humza-sami commented 6 months ago

@simon-mo Separatly they work but my goal is to run two different LLMs. One LLM on 2 GPUs and Second LLM on 3rd GPU

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = ""

os.environ["CUDA_VISIBLE_DEVICES"] = "2"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

RuntimeError Traceback (most recent call last) Cell In[11], line 3 1 os.environ["CUDA_VISIBLE_DEVICES"] = "2" ----> 3 llm_2 = LLM("codellama/CodeLlama-7b-Instruct-hf",max_model_len=4000,gpu_memory_utilization=0.9, tensor_parallel_size=1)

File /usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py:109, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, kwargs) 90 kwargs["disable_log_stats"] = True 91 engine_args = EngineArgs( 92 model=model, 93 tokenizer=tokenizer, (...) 107 kwargs, 108 ) --> 109 self.llm_engine = LLMEngine.from_engine_args(engine_args) 110 self.request_counter = Counter()

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:371, in LLMEngine.from_engine_args(cls, engine_args) 369 placement_group = initialize_cluster(parallel_config) 370 # Create the LLM engine. --> 371 engine = cls(*engine_configs, 372 placement_group, 373 log_stats=not engine_args.disable_log_stats) 374 return engine

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:120, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, lora_config, placement_group, log_stats) 118 self._init_workers_ray(placement_group) 119 else: --> 120 self._init_workers() 122 # Profile the memory usage and initialize the cache. 123 self._init_cache()

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:163, in LLMEngine._init_workers(self) 149 distributed_init_method = get_distributed_init_method( 150 get_ip(), get_open_port()) 151 self.driver_worker = Worker( 152 self.model_config, 153 self.parallel_config, (...) 161 is_driver_worker=True, 162 ) --> 163 self._run_workers("init_model") 164 self._run_workers("load_model")

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:1014, in LLMEngine._run_workers(self, method, driver_args, driver_kwargs, max_concurrent_workers, use_ray_compiled_dag, *args, *kwargs) 1011 driver_kwargs = kwargs 1013 # Start the driver worker after all the ray workers. -> 1014 driver_worker_output = getattr(self.driver_worker, 1015 method)(driver_args, **driver_kwargs) 1017 # Get the results of the ray workers. 1018 if self.workers:

File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:94, in Worker.init_model(self, cupy_port) 91 raise RuntimeError( 92 f"Not support device type: {self.device_config.device}") 93 # Initialize the distributed environment. ---> 94 init_distributed_environment(self.parallel_config, self.rank, 95 cupy_port, self.distributed_init_method) 96 # Initialize the model. 97 set_random_seed(self.model_config.seed)

File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:247, in init_distributed_environment(parallel_config, rank, cupy_port, distributed_init_method) 245 torch_world_size = torch.distributed.get_world_size() 246 if torch_world_size != parallel_config.world_size: --> 247 raise RuntimeError( 248 "torch.distributed is already initialized but the torch world " 249 "size does not match parallel_config.world_size " 250 f"({torch_world_size} vs. {parallel_config.world_size}).") 251 elif not distributed_init_method: 252 raise ValueError( 253 "distributed_init_method must be set if torch.distributed " 254 "is not already initialized")

RuntimeError: torch.distributed is already initialized but the torch world size does not match parallel_config.world_size (2 vs. 1).

KatIsCoding commented 6 months ago

I've had your exact same scenario, my solution was to run on docker-compose, because in there you can specify which GPU ids to make available to each instance

KatIsCoding commented 6 months ago

And then expose their APIs and consume with another script, it would be faster if you run the openai compatible API, however if you want to add something custom like lmformatenforcer, you might need to make the implementation yourself

humza-sami commented 5 months ago

@KatIsCoding Thanks for your suggestion. Yeah I endedup with same thought that I have to implement the ray clustering by myself. What I have noticed is that when I initialized 2nd LLM object then it recreate a cluster of GPU/CPU. If I manually change CUDA_VISIBLE_DEVICES before making 2nd LLM object in same python script then ray confuses and throw error because current configuration clash with 1st LLM object cluster. In single process (script), you cannot make 2nd LLM object by changing CUDA_VISIBLE_DEVICES.

humza-sami commented 5 months ago

@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks

sAviOr287 commented 3 months ago

@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that Do you have any news on this issue?

KatIsCoding commented 3 months ago

@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks

@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that Do you have any news on this issue?

I'm sorry for my late response on the topic, as @sAviOr287 mentioned, there is a ray implementation out there, however I could not find much information about it.

So far my approach to the problem was just using docker and different instances for different models, like so:

version: "3.8"

networks:
  load_balancing:
    name: load_balancing

services:
  sqlcoder:
    profiles: [ai]
    image: aiimage
    shm_size: "15gb"
    command: python3 ./aiplug.service.py
    hostname: sqlcoder
    networks:
      - load_balancing
    environment:
      - MODEL_ID=defog/sqlcoder-7b-2
      - TP_SIZE=1
      - ACCEPT_EMPTY_IDS=1
    build:
      context: .
      dockerfile: ./apps/VLLM/ai-service.Dockerfile
    volumes:
      - ./apps/VLLM/:/app:ro
      - ./models:/aishared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

  llama:
    profiles: [ai-exp]
    image: aiimage
    shm_size: "15gb"
    command: python3 ./aiplug.service.py
    hostname: llama
    networks:
      - load_balancing
    environment:
      - AI_SERVICE_PORT=1337
      - MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
      - ACCEPT_EMPTY_IDS=1
      - TP_SIZE=1
    build:
      context: .
      dockerfile: ./apps/VLLM/ai-service.Dockerfile
    volumes:
      - ./apps/VLLM/:/app:ro
      - ./models:/aishared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
  nginx:
    image: nginx:1.15-alpine
    profiles: [ai]
    networks:
      - load_balancing
    depends_on:
      - sqlcoder
      - llama
    volumes:
      - ./nginx-conf:/etc/nginx/conf.d
    ports:
      - 6565:6565 #SQL Coder
      - 6566:6566 #Llama

It is a load balancing approach, however a different model gets hit depending on which port you are using. My dockerfile is pretty much just installing vllm + some other stuff, however it could be completely replaced with something like the OpenAI implementation vllm has.

KatIsCoding commented 3 months ago

The most important thing about the configuration is the usage of

deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

By specifying a device_ids you are essentially telling docker which GPUs to make available in each process

sparsh35 commented 3 months ago

Any one found any solution , I am trying to use it with accelerate but getting same error

Ash-Zheng commented 3 months ago

I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids in vllm/executor/ray_gpu_executor.py

sAviOr287 commented 3 months ago

Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution. Best

Sent from Outlook for iOShttps://aka.ms/o0ukef


From: Zheng Wang @.> Sent: Wednesday, May 22, 2024 8:39:45 AM To: vllm-project/vllm @.> Cc: Jean-Francois Ton @.>; Mention @.> Subject: Re: [vllm-project/vllm] Unable to specify GPU usage in VLLM code (Issue #3012)

I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids in vllm/executor/ray_gpu_executor.py https://github.com/vllm-project/vllm/blob/5f6d10c14c17122e6d711a4829ee0ca672e07f6f/vllm/executor/ray_gpu_executor.py#L130

— Reply to this email directly, view it on GitHubhttps://github.com/vllm-project/vllm/issues/3012#issuecomment-2124084715, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ADF67DOVYOT62X6QYUVXDHDZDRDUDAVCNFSM6AAAAABDXHH2F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUGA4DINZRGU. You are receiving this because you were mentioned.Message ID: @.***>

Ash-Zheng commented 3 months ago

Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution.

Hi @sAviOr287 , I added the following code in vllm/executor/ray_gpu_executor.py (the GPU id that I want to use is given in self.GPUs):

# update GPU IDs if specified.
if self.GPUs is not None:
    assert (len(self.GPUs) == len(worker_node_and_gpu_ids)), "Number of GPUs specified does not match the number of workers."
    for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
        worker_node_and_gpu_ids[i] = (node_id, [self.GPUs[i]])