vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
21.82k stars 3.08k forks source link

RuntimeError on ROCm #2580

Open rlrs opened 5 months ago

rlrs commented 5 months ago

Example of command: python benchmark_throughput.py --model gpt2 --input-len 256 --output-len 256

Output:

INFO 01-24 14:52:52 llm_engine.py:72] Initializing an LLM engine with config: model='gpt2', tokenizer='gpt2', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=1024, download_dir=None, load_format=auto, tensor_parallel_size=1, quantization=None, enforce_eager=False, seed=0)
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.3.0.dev20240123+rocm5.7)
    Python  3.10.13 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
INFO 01-24 14:52:55 weight_utils.py:164] Using model weights format ['*.safetensors']
Traceback (most recent call last):
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/benchmark_throughput.py", line 318, in <module>
    main(args)
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/benchmark_throughput.py", line 205, in main
    elapsed_time = run_vllm(requests, args.model, args.tokenizer,
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/benchmark_throughput.py", line 76, in run_vllm
    llm = LLM(
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/vllm-0.2.7+rocm573-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 106, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/vllm-0.2.7+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 350, in from_engine_args
    engine = cls(*engine_configs,
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/vllm-0.2.7+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 112, in __init__
    self._init_cache()
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/vllm-0.2.7+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 303, in _init_cache
    num_blocks = self._run_workers(
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/vllm-0.2.7+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 977, in _run_workers
    driver_worker_output = getattr(self.driver_worker,
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/vllm-0.2.7+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 116, in profile_num_available_blocks
    free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
  File "/scratch/project_465000670/danish-foundation-models/scripts/lumi/eval/.venv/lib/python3.10/site-packages/torch/cuda/memory.py", line 655, in mem_get_info
    return torch.cuda.cudart().cudaMemGetInfo(device)
RuntimeError: HIP error: invalid argument
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

Installed packages:

accelerate                0.26.1
aiohttp                   3.9.1
aioprometheus             23.12.0
aiosignal                 1.3.1
annotated-types           0.6.0
anyio                     4.2.0
async-timeout             4.0.3
attrs                     23.2.0
bert-score                0.3.13
bitsandbytes              0.42.0
certifi                   2022.12.7
charset-normalizer        2.1.1
chex                      0.1.85
click                     8.1.7
cmake                     3.28.1
contourpy                 1.2.0
cycler                    0.12.1
datasets                  2.16.1
demjson3                  3.0.6
dill                      0.3.7
einops                    0.7.0
etils                     1.6.0
evaluate                  0.4.1
exceptiongroup            1.2.0
fastapi                   0.109.0
filelock                  3.9.0
flash-attn                2.0.4
flax                      0.8.0
fonttools                 4.47.2
frozenlist                1.4.1
fsspec                    2023.10.0
h11                       0.14.0
httptools                 0.6.1
huggingface-hub           0.20.3
idna                      3.4
importlib-resources       6.1.1
interegular               0.3.3
jax                       0.4.23
jaxlib                    0.4.23
Jinja2                    3.1.2
joblib                    1.3.2
jsonschema                4.21.1
jsonschema-specifications 2023.12.1
kiwisolver                1.4.5
Levenshtein               0.23.0
lm-format-enforcer        0.8.2
markdown-it-py            3.0.0
MarkupSafe                2.1.3
matplotlib                3.8.2
mdurl                     0.1.2
ml-dtypes                 0.3.2
mpmath                    1.2.1
msgpack                   1.0.7
multidict                 6.0.4
multiprocess              0.70.15
nest-asyncio              1.6.0
networkx                  3.0rc1
ninja                     1.11.1.1
nltk                      3.8.1
numpy                     1.26.3
openai                    0.28.1
opt-einsum                3.3.0
optax                     0.1.8
orbax-checkpoint          0.5.1
orjson                    3.9.12
packaging                 23.2
pandas                    1.5.3
Pillow                    9.3.0
pip                       23.3.2
protobuf                  3.20.3
psutil                    5.9.8
pyarrow                   14.0.2
pyarrow-hotfix            0.6
pydantic                  2.5.3
pydantic_core             2.14.6
Pygments                  2.17.2
pyinfer                   0.0.3
pyparsing                 3.1.1
python-dateutil           2.8.2
python-dotenv             0.21.1
pytorch-triton-rocm       2.2.0+dafe145982
pytz                      2023.3.post1
PyYAML                    6.0.1
quantile-python           1.1
rapidfuzz                 3.6.1
ray                       2.9.1
referencing               0.32.1
regex                     2023.12.25
requests                  2.31.0
responses                 0.18.0
rich                      13.7.0
rouge_score               0.1.2
rpds-py                   0.17.1
sacremoses                0.1.1
safetensors               0.4.1
scandeval                 9.2.0
scikit-learn              1.4.0
scipy                     1.12.0
sentencepiece             0.1.99
seqeval                   1.2.2
setuptools                65.5.0
six                       1.16.0
sniffio                   1.3.0
starlette                 0.35.1
sympy                     1.11.1
tabulate                  0.9.0
tensorstore               0.1.52
termcolor                 2.4.0
threadpoolctl             3.2.0
tiktoken                  0.5.2
tokenizers                0.15.1
toolz                     0.12.1
torch                     2.3.0.dev20240123+rocm5.7
torchaudio                2.2.0.dev20240123+rocm5.7
torchvision               0.18.0.dev20240123+rocm5.7
tqdm                      4.66.1
transformers              4.37.0
typing_extensions         4.9.0
urllib3                   1.26.13
uvicorn                   0.27.0
uvloop                    0.19.0
vllm                      0.2.7+rocm573
watchfiles                0.21.0
websockets                12.0
xformers                  0.0.23
xxhash                    3.4.1
yarl                      1.9.4
zipp                      3.17.0

This is running in the rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 container on a node with MI250X GPUs.

double-vin commented 4 months ago

I also encountered this problem, , I manually modified the free_gpu_memory and total_gpu_memory.

saattrupdan commented 4 months ago

This seems similar-ish to this issue. Can you see if any of these, or combinations of them, work?

export PYTORCH_ROCM_ARCH="gfx1031" export HSA_OVERRIDE_GFX_VERSION=10.3.1 export HIP_VISIBLE_DEVICES=0 export ROCM_PATH=/opt/rocm

rlrs commented 4 months ago

Looking more into this, HSA_OVERRIDE_GFX_VERSION does impact what happens. Given that MI250X is on the gfx90a architecture, I tried HSA_OVERRIDE_GFX_VERSION=9.0.0 which at least gives another error,

HSA_OVERRIDE_GFX_VERSION=9.0.0 python benchmark_throughput.py --model EleutherAI/pythia-70m --input-len 256 --output-len 256 --
num-prompts 100 --backend vllm
Namespace(backend='vllm', dataset=None, input_len=256, output_len=256, model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 02-08 16:24:00 config.py:393] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 02-08 16:24:00 llm_engine.py:73] Initializing an LLM engine with config: model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
    PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.3.0.dev20240207+rocm5.7)
    Python  3.10.13 (you have 3.10.13)
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
  Memory-efficient attention, SwiGLU, sparse and more won't be available.
  Set XFORMERS_MORE_DETAILS=1 for more details
Memory access fault by GPU node-4 (Agent handle: 0x907d170) on address 0x15236fcae000. Reason: Unknown.
Aborted

Alternatively, with HSA_OVERRIDE_GFX_VERSION=9.0.2 I seem to get further.

HSA_OVERRIDE_GFX_VERSION=9.0.2 python benchmark_throughput.py --model EleutherAI/pythia-70m --input-len 256 --output-len 256 --num-prompts 100 --backend vllm
Namespace(backend='vllm', dataset=None, input_len=256, output_len=256, model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 02-08 16:25:49 config.py:393] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 02-08 16:25:49 llm_engine.py:73] Initializing an LLM engine with config: model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2128, in all_reduce
[rank0]:     work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2006, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid kernel file'

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/benchmark_throughput.py", line 318, in <module>
[rank0]:     main(args)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/benchmark_throughput.py", line 205, in main
[rank0]:     elapsed_time = run_vllm(requests, args.model, args.tokenizer,
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/benchmark_throughput.py", line 76, in run_vllm
[rank0]:     llm = LLM(
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 109, in __init__
[rank0]:     self.llm_engine = LLMEngine.from_engine_args(engine_args)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 361, in from_engine_args
[rank0]:     engine = cls(*engine_configs,
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 114, in __init__
[rank0]:     self._init_workers()
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 153, in _init_workers
[rank0]:     self._run_workers("init_model")
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 989, in _run_workers
[rank0]:     driver_worker_output = getattr(self.driver_worker,
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 90, in init_model
[rank0]:     init_distributed_environment(self.parallel_config, self.rank,
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 259, in init_distributed_environment
[rank0]:     torch.distributed.all_reduce(torch.zeros(1).cuda())
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 77, in wrapper
[rank0]:     msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 50, in _get_msg_dict
[rank0]:     "args": f"{args}, {kwargs}",
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 463, in __repr__
[rank0]:     return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 677, in _str
[rank0]:     return _str_intern(self, tensor_contents=tensor_contents)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 597, in _str_intern
[rank0]:     tensor_str = _tensor_str(self, indent)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
[rank0]:     formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank0]:   File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 138, in __init__
[rank0]:     tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
[rank0]: RuntimeError: HIP error: invalid device function
[rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3.
[rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
saattrupdan commented 4 months ago

Progress! What happens if you set AMD_SERIALIZE_KERNEL=3? Maybe we'll get a more informative error.

hmellor commented 2 months ago

@rlrs has this issue been resolved now?

linchen111 commented 20 hours ago

has this issue been resolved now?