vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.05k stars 3.82k forks source link

Multi-GPU Support Failures with AMD MI210 #2942

Open tom-papatheodore opened 6 months ago

tom-papatheodore commented 6 months ago

Hello. Thank for providing vLLM as a great open-source tool for inference and model serving! I was able to build vLLM on a cluster I maintain, but it only appears to work on a single MI210 GPU. Can someone please help me with this issue? The details of my attempt are as follows...

This is how I built vLLM on a node comprised of [2x] 64-core AMD EPYC CPUs and [4x] AMD MI210 GPUs with ROCm 5.7.1 installed:

## STEP 0: Create new Python virtual environment and activate it
python -m venv venv
source venv/bin/activate

## STEP 1: Install ROCm-enabled PyTorch
pip install --upgrade pip
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.7

## STEP 2: Build and install flash-attention package
git clone --recursive https://github.com/ROCmSoftwarePlatform/flash-attention.git
cd flash-attention
export GPU_ARCHS="gfx90a"
export PYTHON_SITE_PACKAGES=$(python -c 'import site; print(site.getsitepackages()[0])')
pip install packaging
pip install --upgrade setuptools wheel # Upgrading wheel was necessary to avoid build issues
pip install . # This takes multiple hours for some reason, but not your problem : )

## STEP 3: Install xformers
pip install xformers==0.0.23 --no-deps

## STEP 4: Build vLLM, but patch xformers first
cd ..
git clone https://github.com/vllm-project/vllm.git
cd vllm
bash patch_xformers.rocm.sh
pip install -U -r requirements-rocm.txt
python setup.py install

## NOTE, This patch to ROCm 5.7.1 was needed to build vLLM. 
## Otherwise, I ran into errors about multiple definitions.
## https://github.com/vllm-project/vllm/pull/2790/files

So everything is built and this point, time to test!

## Single MI210 test
cat single-gpu-test.py
from vllm import LLM
llm = LLM(model="/work1/staff/papatheodore/model-serving/llama2/models/Llama-2-13b-chat-hf")
output = llm.generate("San Francisco is a")
print(output)

python single-gpu-test.py
INFO 02-20 08:40:29 llm_engine.py:79] Initializing an LLM engine with config: model='/work1/staff/papatheodore/model-serving/llama2/models/Llama-2-13b-chat-hf', tokenizer='/work1/staff/papatheodore/model-serving/llama2/models/Llama-2-13b-chat-hf', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.2.0+rocm5.7)
    Python  3.9.18 (you have 3.9.16)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
INFO 02-20 08:40:38 llm_engine.py:337] # GPU blocks: 2658, # CPU blocks: 327
INFO 02-20 08:40:38 model_runner.py:666] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 02-20 08:40:38 model_runner.py:670] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 02-20 08:40:42 model_runner.py:738] Graph capturing finished in 4 secs.
Processed prompts: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.06it/s]
[RequestOutput(request_id=0, prompt='San Francisco is a', prompt_token_ids=[1, 3087, 8970, 338, 263], prompt_logprobs=None, outputs=[CompletionOutput(index=0, text=" wonderful city, full of variety and culture. Whether you're into art,", token_ids=[20695, 4272, 29892, 2989, 310, 12875, 322, 9257, 29889, 26460, 366, 29915, 276, 964, 1616, 29892], cumulative_logprob=-26.27613702166127, logprobs=None, finish_reason=length)], finished=True, lora_request=None)]

## SUCCESS!! Woohoo!

## Multi-GPU Test
cat multi-gpu-test.py
from vllm import LLM
llm = LLM(model="/work1/staff/papatheodore/model-serving/llama2/models/Llama-2-13b-chat-hf", tensor_parallel_size=4)
output = llm.generate("San Francisco is a")
print(output)

python multi-gpu-test.py
INFO 02-20 08:39:22 config.py:400] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
2024-02-20 08:39:24,314 INFO worker.py:1724 -- Started a local Ray instance.
E0220 08:39:27.534037925 3102213 thd.cc:157]                           pthread_create failed: Resource temporarily unavailable
*** SIGTERM received at time=1708440012 on cpu 106 ***
PC: @     0x7f992b4fa43e  (unknown)  recv
    @     0x7f992b3ffd90  (unknown)  (unknown)
    @     0x7f971a55c502         80  boost::asio::detail::socket_ops::sync_recv1()
    @     0x7f9719e853a8        112  ray::ServerConnection::ReadBuffer()
    @     0x7f9719e869c2        496  ray::ServerConnection::ReadMessage()
    @     0x7f9719cbf304        112  ray::raylet::RayletConnection::AtomicRequestReply()
    @     0x7f9719cbfbbc        432  ray::raylet::RayletClient::RayletClient()
    @     0x7f9719c1f4ce       1248  ray::core::CoreWorker::CoreWorker()
    @     0x7f9719c2596f        624  ray::core::CoreWorkerProcessImpl::CoreWorkerProcessImpl()
    @     0x7f9719c26aaf         80  ray::core::CoreWorkerProcess::Initialize()
    @     0x7f9719ae6a1f       2016  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
    @     0x7f9719ae7cd9         48  __pyx_tp_new_3ray_7_raylet_CoreWorker()
    @     0x7f992b6cc85b  (unknown)  (unknown)
    @     0x7f992b8ebee0  (unknown)  (unknown)
[2024-02-20 08:40:12,796 E 3102146 3102146] logging.cc:361: *** SIGTERM received at time=1708440012 on cpu 106 ***
[2024-02-20 08:40:12,796 E 3102146 3102146] logging.cc:361: PC: @     0x7f992b4fa43e  (unknown)  recv
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f992b3ffd90  (unknown)  (unknown)
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f971a55c502         80  boost::asio::detail::socket_ops::sync_recv1()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719e853a8        112  ray::ServerConnection::ReadBuffer()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719e869c2        496  ray::ServerConnection::ReadMessage()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719cbf304        112  ray::raylet::RayletConnection::AtomicRequestReply()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719cbfbbc        432  ray::raylet::RayletClient::RayletClient()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719c1f4ce       1248  ray::core::CoreWorker::CoreWorker()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719c2596f        624  ray::core::CoreWorkerProcessImpl::CoreWorkerProcessImpl()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719c26aaf         80  ray::core::CoreWorkerProcess::Initialize()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719ae6a1f       2016  __pyx_pw_3ray_7_raylet_10CoreWorker_1__cinit__()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f9719ae7cd9         48  __pyx_tp_new_3ray_7_raylet_CoreWorker()
[2024-02-20 08:40:12,797 E 3102146 3102146] logging.cc:361:     @     0x7f992b6cc85b  (unknown)  (unknown)
[2024-02-20 08:40:12,798 E 3102146 3102146] logging.cc:361:     @     0x7f992b8ebee0  (unknown)  (unknown)
Aborted (core dumped)

NOTES:

I would be grateful for any help you could offer on resolving this issue. Thank you :)

-Tom

hliuca commented 6 months ago

enforce eager?

tom-papatheodore commented 4 months ago

When attempting to serve a model with more than 1 GPU, the --enforce-eager flag only seems to stop my attempt from hanging, but it still crashes. I can serve the model from a single GPU fine.

This is how I attempt to serve the model with multiple GPUs:

$ cat multi-gpu-vllm-server-start.sh
#!/bin/bash

python -m vllm.entrypoints.openai.api_server --model /work1/staff/papatheodore/model-serving/llama2-70b/models/Llama-2-13b-chat-hf --host 127.0.0.1 --port 8080 --tensor-parallel-size 4 --enforce-eager

And this is how it fails (it's unclear whether this is a vLLM or RAY issue):

$ ./multi-gpu-vllm-server-start.sh
INFO 04-12 21:48:39 api_server.py:229] args: Namespace(host='127.0.0.1', port=8080, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/work1/staff/papatheodore/model-serving/llama2-70b/models/Llama-2-13b-chat-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 04-12 21:48:39 config.py:400] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
2024-04-12 21:48:41,785 INFO worker.py:1724 -- Started a local Ray instance.
[2024-04-12 21:48:43,965 E 205266 205266] core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory

And this is what's installed in my python virtual environment (from the vLLM ROCm docs):

$ pip list
Package                   Version
------------------------- --------------
aioprometheus             23.12.0
aiosignal                 1.3.1
annotated-types           0.6.0
anyio                     4.3.0
argon2-cffi               23.1.0
argon2-cffi-bindings      21.2.0
arrow                     1.3.0
asttokens                 2.4.1
async-lru                 2.0.4
attrs                     23.2.0
Babel                     2.14.0
beautifulsoup4            4.12.3
bleach                    6.1.0
certifi                   2022.12.7
cffi                      1.16.0
charset-normalizer        2.1.1
click                     8.1.7
comm                      0.2.2
debugpy                   1.8.1
decorator                 5.1.1
defusedxml                0.7.1
distro                    1.9.0
einops                    0.7.0
exceptiongroup            1.2.0
executing                 2.0.1
fastapi                   0.109.2
fastjsonschema            2.19.1
filelock                  3.9.0
flash_attn                2.0.4
fqdn                      1.5.1
frozenlist                1.4.1
fsspec                    2024.2.0
h11                       0.14.0
httpcore                  1.0.4
httptools                 0.6.1
httpx                     0.27.0
huggingface-hub           0.20.3
idna                      3.4
importlib_metadata        7.0.2
ipykernel                 6.29.3
ipython                   8.18.1
ipywidgets                8.1.2
isoduration               20.11.0
jedi                      0.19.1
Jinja2                    3.1.2
json5                     0.9.23
jsonpointer               2.4
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
jupyter                   1.0.0
jupyter_client            8.6.1
jupyter-console           6.6.3
jupyter_core              5.7.2
jupyter-events            0.9.1
jupyter-lsp               2.2.4
jupyter_server            2.13.0
jupyter_server_terminals  0.5.3
jupyterlab                4.1.5
jupyterlab_pygments       0.3.0
jupyterlab_server         2.25.4
jupyterlab_widgets        3.0.10
MarkupSafe                2.1.3
matplotlib-inline         0.1.6
mistune                   3.0.2
mpmath                    1.3.0
msgpack                   1.0.7
nbclient                  0.10.0
nbconvert                 7.16.2
nbformat                  5.10.3
nest-asyncio              1.6.0
networkx                  3.2.1
ninja                     1.11.1.1
notebook                  7.1.2
notebook_shim             0.2.4
numpy                     1.26.4
openai                    1.14.0
orjson                    3.9.14
overrides                 7.7.0
packaging                 23.2
pandocfilters             1.5.1
parso                     0.8.3
pexpect                   4.9.0
pillow                    10.2.0
pip                       24.0
platformdirs              4.2.0
prometheus_client         0.20.0
prompt-toolkit            3.0.43
protobuf                  4.25.3
psutil                    5.9.8
ptyprocess                0.7.0
pure-eval                 0.2.2
pycparser                 2.21
pydantic                  2.6.1
pydantic_core             2.16.2
Pygments                  2.17.2
python-dateutil           2.9.0.post0
python-dotenv             1.0.1
python-json-logger        2.0.7
pytorch-triton-rocm       2.2.0
PyYAML                    6.0.1
pyzmq                     25.1.2
qtconsole                 5.5.1
QtPy                      2.4.1
quantile-python           1.1
ray                       2.9.2
referencing               0.33.0
regex                     2023.12.25
requests                  2.31.0
rfc3339-validator         0.1.4
rfc3986-validator         0.1.1
rpds-py                   0.18.0
safetensors               0.4.2
Send2Trash                1.8.2
sentencepiece             0.2.0
setuptools                69.1.0
six                       1.16.0
sniffio                   1.3.0
soupsieve                 2.5
stack-data                0.6.3
starlette                 0.36.3
sympy                     1.12
terminado                 0.18.1
tinycss2                  1.2.1
tokenizers                0.15.2
tomli                     2.0.1
torch                     2.2.0+rocm5.7
torchaudio                2.2.0+rocm5.7
torchvision               0.17.0+rocm5.7
tornado                   6.4
tqdm                      4.66.2
traitlets                 5.14.2
transformers              4.37.2
types-python-dateutil     2.9.0.20240316
typing_extensions         4.9.0
uri-template              1.3.0
urllib3                   1.26.13
uvicorn                   0.27.1
uvloop                    0.19.0
vllm                      0.3.1+rocm573
watchfiles                0.21.0
wcwidth                   0.2.13
webcolors                 1.13
webencodings              0.5.1
websocket-client          1.7.0
websockets                12.0
wheel                     0.42.0
widgetsnbextension        4.0.10
xformers                  0.0.23
zipp                      3.18.1

This is running on a compute node with [4x] MI210 GPUs. I've tried running it with ROCm 5.7.1 and 6.0.2 with the same results. Any guidance you can offer would be greatly appreciated. Thank you.

tom-papatheodore commented 2 months ago

Hello. I wanted to ping this issue to see if there was any guidance on a solution or workaround. Please let me know. Thanks.

youkaichao commented 2 months ago

forwarded to amd folks.

we also have some tips for debugging this

tom-papatheodore commented 2 months ago

Thanks, @youkaichao. I'll take a look.

hliuca commented 2 months ago

flash attention not right branch/commit??

https://github.com/vllm-project/vllm/blob/v0.4.0/Dockerfile.rocm