vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
25.42k stars 3.67k forks source link

Multi GPU ROCm6 issues, and workarounds #2794

Open BKitor opened 6 months ago

BKitor commented 6 months ago

I ran into a series of issues trying to get VLLM stood up on a system with multiple MI210s. I figured I'd document my issues and workarounds so that someone could pick up the baton later, or at least save someone some debugging time later.

  1. Ray will deadlock with multiple AMD GPUs. Ray doesn't officially support AMD GPUs in v2.9; I updated Ray to nightlies (v3.0).

    pip uninstall ray
    pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp310-cp310-manylinux2014_x86_64.whl"
  2. Something might have changed with how Ray exposes GPUs to workers. Only 1 GPU was exposed to each worker, so torch.cuda.set_device() with anything other than 0 would fail. I tweaked worker.py to always use 0, but I don't think this is a viable long-term fix.

    diff --git a/vllm/worker/worker.py b/vllm/worker/worker.py
    index c97e82a..a63fbd9 100644
    --- a/vllm/worker/worker.py
    +++ b/vllm/worker/worker.py
    @@ -68,6 +68,7 @@ class Worker:
         self.gpu_cache = None
    
     def init_model(self) -> None:
    +        print(f"***** local_rank {self.local_rank} hit init_model, is_driver: {self.is_driver_worker} *****")
         if self.device_config.device.type == "cuda":
             # torch.distributed.all_reduce does not free the input tensor until
             # the synchronization point. This causes the memory usage to grow
    @@ -80,7 +81,9 @@ class Worker:
             # This env var set by Ray causes exceptions with graph building.
             os.environ.pop("NCCL_ASYNC_ERROR_HANDLING", None)
             self.device = torch.device(f"cuda:{self.local_rank}")
    -            torch.cuda.set_device(self.device)
    +            print(f"***** trying to set dev {self.device} of {torch.cuda.device_count()} is_driver: {self.is_driver_worker} *****")
    +            # torch.cuda.set_device(self.device)
    +            torch.cuda.set_device(0)
    
             _check_if_gpu_supports_dtype(self.model_config.dtype)
         else:
SuperBruceJia commented 2 months ago

@BKitor Have you found any solution for distributed inference? Thank you very much in advance!

Best regards,

Shuyue June 9th, 2024

BKitor commented 2 months ago

Sorry, haven't poked this in a while (lost access to multi-node system). But for single-node multi-gpu training, the 'mp' distributed_execution_backend has been fairly stable.

SuperBruceJia commented 2 months ago

@BKitor Benjamin, I am using single-node multi-GPUs but there is a problem regarding the init_device (https://github.com/vllm-project/vllm/blob/main/vllm/worker/worker.py#L92-L118).

Do you have any idea how to solve it?

Thank you very much, and have a nice day!

2024-06-10 16:36:06,142 INFO worker.py:1568 -- Connecting to existing Ray cluster at address: 192.168.19.245:6379...
2024-06-10 16:36:06,142 INFO worker.py:1586 -- Calling ray.init() again after it has already been called.
INFO 06-10 16:36:06 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='./save_folder', speculative_config=None, tokenizer='meta-llama/Meta-Llama-3-8B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=./save_folder)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(pid=487121) /usr4/ec523/brucejia/.local/lib/python3.10/site-packages/transformers/utils/hub.py:124: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
(pid=487121)   warnings.warn(
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Traceback (most recent call last):
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 141, in execute_method
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 106, in init_device
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     torch.cuda.set_device(self.device)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]   File "/usr4/ec523/brucejia/.local/lib/python3.10/site-packages/torch/cuda/__init__.py", line 399, in set_device
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149]     torch._C._cuda_setDevice(device)
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
(RayWorkerWrapper pid=487232) ERROR 06-10 16:36:10 worker_base.py:149] 

Best regards,

Shuyue June 10th, 2024

BKitor commented 2 months ago

What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

SuperBruceJia commented 2 months ago

What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

@BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving.

May I know which vLLM version you are using?

Thank you very much, and have a nice day!

Best regards,

Shuyue June 10th, 2024

BKitor commented 2 months ago

The file you're looking for is args_util.py, and it's present in 0.4.3

https://github.com/vllm-project/vllm/blob/1197e02141df1a7442f21ff6922c98ec0bba153e/vllm/engine/arg_utils.py#L38

On Mon, Jun 10, 2024 at 2:36 PM Shuyue Jia @.***> wrote:

What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(<whatever your args already are>, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV.

@BKitor https://github.com/BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving.

May I know which vLLM version you are using?

Thank you very much, and have a nice day!

Best regards,

Shuyue June 10th, 2024

— Reply to this email directly, view it on GitHub https://github.com/vllm-project/vllm/issues/2794#issuecomment-2159324295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAZJUXNV36SANY7YRYL5OTZGYL4FAVCNFSM6AAAAABC4X7FNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGMZDIMRZGU . You are receiving this because you were mentioned.Message ID: @.***>

SuperBruceJia commented 2 months ago

The file you're looking for is args_util.py, and it's present in 0.4.3 https://github.com/vllm-project/vllm/blob/1197e02141df1a7442f21ff6922c98ec0bba153e/vllm/engine/arg_utils.py#L38 On Mon, Jun 10, 2024 at 2:36 PM Shuyue Jia @.> wrote: What I'm suggesting is to not use ray. One of the arguments when instantiating a model is distributed_execution_backend, where the options include 'ray' or 'mp'. I'm not sure how you're launching your model, you might have to insert distrubted_execution_backend="mp" where you create the llm. i.e. from vllm import LLM; llm = LLM(, distrubted_execution_backend="mp). Otherwise, some of the provided helper scripts let you specify --distributed-execution-backend on the command line, but this isn't universal to YMYV. @BKitor https://github.com/BKitor Benjamin, it seems that there is no distrubted_execution_backend argument in LLM: https://github.com/vllm-project/vllm/blob/main/vllm/engine/llm_engine.py. However, there is one in the AsyncLLMEngine: https://github.com/vllm-project/vllm/blob/main/vllm/engine/async_llm_engine.py, which is for serving. May I know which vLLM version you are using? Thank you very much, and have a nice day! Best regards, Shuyue June 10th, 2024 — Reply to this email directly, view it on GitHub <#2794 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AEAZJUXNV36SANY7YRYL5OTZGYL4FAVCNFSM6AAAAABC4X7FNGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJZGMZDIMRZGU . You are receiving this because you were mentioned.Message ID: @.>

Thank you very much, Benjamin! It really helps.

Now, my multi-gpu inference can be running smoothly. For other researchers' reference:

I use vLLM 0.4.3: https://github.com/vllm-project/vllm/releases/tag/v0.4.3

llm = LLM(
            model=save_dir,
            tokenizer=model_name,
            dtype='bfloat16',
            distributed_executor_backend="mp",
            tensor_parallel_size=num_gpus_vllm,
            gpu_memory_utilization=gpu_utilization_vllm,
            enable_lora=False,
        )

sampling_params = SamplingParams(
        temperature=0,
        top_p=1,
        max_tokens=max_new_tokens,
        stop=stop_tokens
    )

completions = llm.generate(
            prompts,
            sampling_params,
        )

@BKitor However, the GPU memory cannot be released, except the first-initialized GPU (cuda:0 in my case).

import gc

import torch
from vllm.distributed.parallel_state import destroy_model_parallel

# Delete the llm object and free the memory
destroy_model_parallel()
del llm.llm_engine.model_executor.driver_worker
del llm
gc.collect()
torch.cuda.empty_cache()
print("Successfully delete the llm pipeline and free the GPU memory.")

Do you have suggestions on releasing all the GPUs' memory?

Thank you very much, and have a nice day!

Best regards,

Shuyue June 11th, 2024