Open tom-papatheodore opened 6 months ago
enforce eager?
When attempting to serve a model with more than 1 GPU, the --enforce-eager
flag only seems to stop my attempt from hanging, but it still crashes. I can serve the model from a single GPU fine.
This is how I attempt to serve the model with multiple GPUs:
$ cat multi-gpu-vllm-server-start.sh
#!/bin/bash
python -m vllm.entrypoints.openai.api_server --model /work1/staff/papatheodore/model-serving/llama2-70b/models/Llama-2-13b-chat-hf --host 127.0.0.1 --port 8080 --tensor-parallel-size 4 --enforce-eager
And this is how it fails (it's unclear whether this is a vLLM or RAY issue):
$ ./multi-gpu-vllm-server-start.sh
INFO 04-12 21:48:39 api_server.py:229] args: Namespace(host='127.0.0.1', port=8080, allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, served_model_name=None, lora_modules=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, root_path=None, middleware=[], model='/work1/staff/papatheodore/model-serving/llama2-70b/models/Llama-2-13b-chat-hf', tokenizer=None, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', dtype='auto', kv_cache_dtype='auto', max_model_len=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=4, max_parallel_loading_workers=None, block_size=16, seed=0, swap_space=4, gpu_memory_utilization=0.9, max_num_batched_tokens=None, max_num_seqs=256, max_paddings=256, disable_log_stats=False, quantization=None, enforce_eager=True, max_context_len_to_capture=8192, disable_custom_all_reduce=False, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', max_cpu_loras=None, device='cuda', engine_use_ray=False, disable_log_requests=False, max_log_len=None)
INFO 04-12 21:48:39 config.py:400] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
2024-04-12 21:48:41,785 INFO worker.py:1724 -- Started a local Ray instance.
[2024-04-12 21:48:43,965 E 205266 205266] core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
And this is what's installed in my python virtual environment (from the vLLM ROCm docs):
$ pip list
Package Version
------------------------- --------------
aioprometheus 23.12.0
aiosignal 1.3.1
annotated-types 0.6.0
anyio 4.3.0
argon2-cffi 23.1.0
argon2-cffi-bindings 21.2.0
arrow 1.3.0
asttokens 2.4.1
async-lru 2.0.4
attrs 23.2.0
Babel 2.14.0
beautifulsoup4 4.12.3
bleach 6.1.0
certifi 2022.12.7
cffi 1.16.0
charset-normalizer 2.1.1
click 8.1.7
comm 0.2.2
debugpy 1.8.1
decorator 5.1.1
defusedxml 0.7.1
distro 1.9.0
einops 0.7.0
exceptiongroup 1.2.0
executing 2.0.1
fastapi 0.109.2
fastjsonschema 2.19.1
filelock 3.9.0
flash_attn 2.0.4
fqdn 1.5.1
frozenlist 1.4.1
fsspec 2024.2.0
h11 0.14.0
httpcore 1.0.4
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.20.3
idna 3.4
importlib_metadata 7.0.2
ipykernel 6.29.3
ipython 8.18.1
ipywidgets 8.1.2
isoduration 20.11.0
jedi 0.19.1
Jinja2 3.1.2
json5 0.9.23
jsonpointer 2.4
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
jupyter 1.0.0
jupyter_client 8.6.1
jupyter-console 6.6.3
jupyter_core 5.7.2
jupyter-events 0.9.1
jupyter-lsp 2.2.4
jupyter_server 2.13.0
jupyter_server_terminals 0.5.3
jupyterlab 4.1.5
jupyterlab_pygments 0.3.0
jupyterlab_server 2.25.4
jupyterlab_widgets 3.0.10
MarkupSafe 2.1.3
matplotlib-inline 0.1.6
mistune 3.0.2
mpmath 1.3.0
msgpack 1.0.7
nbclient 0.10.0
nbconvert 7.16.2
nbformat 5.10.3
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.1
notebook 7.1.2
notebook_shim 0.2.4
numpy 1.26.4
openai 1.14.0
orjson 3.9.14
overrides 7.7.0
packaging 23.2
pandocfilters 1.5.1
parso 0.8.3
pexpect 4.9.0
pillow 10.2.0
pip 24.0
platformdirs 4.2.0
prometheus_client 0.20.0
prompt-toolkit 3.0.43
protobuf 4.25.3
psutil 5.9.8
ptyprocess 0.7.0
pure-eval 0.2.2
pycparser 2.21
pydantic 2.6.1
pydantic_core 2.16.2
Pygments 2.17.2
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-json-logger 2.0.7
pytorch-triton-rocm 2.2.0
PyYAML 6.0.1
pyzmq 25.1.2
qtconsole 5.5.1
QtPy 2.4.1
quantile-python 1.1
ray 2.9.2
referencing 0.33.0
regex 2023.12.25
requests 2.31.0
rfc3339-validator 0.1.4
rfc3986-validator 0.1.1
rpds-py 0.18.0
safetensors 0.4.2
Send2Trash 1.8.2
sentencepiece 0.2.0
setuptools 69.1.0
six 1.16.0
sniffio 1.3.0
soupsieve 2.5
stack-data 0.6.3
starlette 0.36.3
sympy 1.12
terminado 0.18.1
tinycss2 1.2.1
tokenizers 0.15.2
tomli 2.0.1
torch 2.2.0+rocm5.7
torchaudio 2.2.0+rocm5.7
torchvision 0.17.0+rocm5.7
tornado 6.4
tqdm 4.66.2
traitlets 5.14.2
transformers 4.37.2
types-python-dateutil 2.9.0.20240316
typing_extensions 4.9.0
uri-template 1.3.0
urllib3 1.26.13
uvicorn 0.27.1
uvloop 0.19.0
vllm 0.3.1+rocm573
watchfiles 0.21.0
wcwidth 0.2.13
webcolors 1.13
webencodings 0.5.1
websocket-client 1.7.0
websockets 12.0
wheel 0.42.0
widgetsnbextension 4.0.10
xformers 0.0.23
zipp 3.18.1
This is running on a compute node with [4x] MI210 GPUs. I've tried running it with ROCm 5.7.1 and 6.0.2 with the same results. Any guidance you can offer would be greatly appreciated. Thank you.
Hello. I wanted to ping this issue to see if there was any guidance on a solution or workaround. Please let me know. Thanks.
forwarded to amd folks.
we also have some tips for debugging this
Thanks, @youkaichao. I'll take a look.
flash attention not right branch/commit??
https://github.com/vllm-project/vllm/blob/v0.4.0/Dockerfile.rocm
Hello. Thank for providing vLLM as a great open-source tool for inference and model serving! I was able to build vLLM on a cluster I maintain, but it only appears to work on a single MI210 GPU. Can someone please help me with this issue? The details of my attempt are as follows...
This is how I built vLLM on a node comprised of [2x] 64-core AMD EPYC CPUs and [4x] AMD MI210 GPUs with ROCm 5.7.1 installed:
So everything is built and this point, time to test!
NOTES:
kill <pid>
to exit the hanging process, which accounts for the*** SIGTERM received at time=1708440012 on cpu 106 ***
[2024-02-20 08:34:52,678 E 3089178 3089178] core_worker.cc:215: Failed to register worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff to Raylet. IOError: [RayletClient] Unable to register worker with raylet. No such file or directory
HIP_VISIBLE_DEVICES
,ROCR_VISIBLE_DEVICES
, and evenCUDA_VISIBLE_DEVICES
, but it didn't change the result.I would be grateful for any help you could offer on resolving this issue. Thank you :)
-Tom