Open rlrs opened 5 months ago
I also encountered this problem, , I manually modified the free_gpu_memory and total_gpu_memory.
This seems similar-ish to this issue. Can you see if any of these, or combinations of them, work?
export PYTORCH_ROCM_ARCH="gfx1031" export HSA_OVERRIDE_GFX_VERSION=10.3.1 export HIP_VISIBLE_DEVICES=0 export ROCM_PATH=/opt/rocm
Looking more into this, HSA_OVERRIDE_GFX_VERSION
does impact what happens. Given that MI250X is on the gfx90a architecture, I tried HSA_OVERRIDE_GFX_VERSION=9.0.0
which at least gives another error,
HSA_OVERRIDE_GFX_VERSION=9.0.0 python benchmark_throughput.py --model EleutherAI/pythia-70m --input-len 256 --output-len 256 --
num-prompts 100 --backend vllm
Namespace(backend='vllm', dataset=None, input_len=256, output_len=256, model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 02-08 16:24:00 config.py:393] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 02-08 16:24:00 llm_engine.py:73] Initializing an LLM engine with config: model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:
PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.3.0.dev20240207+rocm5.7)
Python 3.10.13 (you have 3.10.13)
Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)
Memory-efficient attention, SwiGLU, sparse and more won't be available.
Set XFORMERS_MORE_DETAILS=1 for more details
Memory access fault by GPU node-4 (Agent handle: 0x907d170) on address 0x15236fcae000. Reason: Unknown.
Aborted
Alternatively, with HSA_OVERRIDE_GFX_VERSION=9.0.2
I seem to get further.
HSA_OVERRIDE_GFX_VERSION=9.0.2 python benchmark_throughput.py --model EleutherAI/pythia-70m --input-len 256 --output-len 256 --num-prompts 100 --backend vllm
Namespace(backend='vllm', dataset=None, input_len=256, output_len=256, model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', quantization=None, tensor_parallel_size=1, n=1, use_beam_search=False, num_prompts=100, seed=0, hf_max_batch_size=None, trust_remote_code=False, max_model_len=None, dtype='auto', enforce_eager=False)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 02-08 16:25:49 config.py:393] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs.
INFO 02-08 16:25:49 llm_engine.py:73] Initializing an LLM engine with config: model='EleutherAI/pythia-70m', tokenizer='EleutherAI/pythia-70m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
[rank0]: Traceback (most recent call last):
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 75, in wrapper
[rank0]: return func(*args, **kwargs)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 2128, in all_reduce
[rank0]: work = group.allreduce([tensor], opts)
[rank0]: torch.distributed.DistBackendError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2006, unhandled cuda error, NCCL version 2.17.1
[rank0]: ncclUnhandledCudaError: Call to CUDA function failed.
[rank0]: Last error:
[rank0]: Cuda failure 'invalid kernel file'
[rank0]: During handling of the above exception, another exception occurred:
[rank0]: Traceback (most recent call last):
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/benchmark_throughput.py", line 318, in <module>
[rank0]: main(args)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/benchmark_throughput.py", line 205, in main
[rank0]: elapsed_time = run_vllm(requests, args.model, args.tokenizer,
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/benchmark_throughput.py", line 76, in run_vllm
[rank0]: llm = LLM(
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 109, in __init__
[rank0]: self.llm_engine = LLMEngine.from_engine_args(engine_args)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 361, in from_engine_args
[rank0]: engine = cls(*engine_configs,
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 114, in __init__
[rank0]: self._init_workers()
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 153, in _init_workers
[rank0]: self._run_workers("init_model")
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 989, in _run_workers
[rank0]: driver_worker_output = getattr(self.driver_worker,
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 90, in init_model
[rank0]: init_distributed_environment(self.parallel_config, self.rank,
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/vllm-0.3.0+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 259, in init_distributed_environment
[rank0]: torch.distributed.all_reduce(torch.zeros(1).cuda())
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 77, in wrapper
[rank0]: msg_dict = _get_msg_dict(func.__name__, *args, **kwargs)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 50, in _get_msg_dict
[rank0]: "args": f"{args}, {kwargs}",
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor.py", line 463, in __repr__
[rank0]: return torch._tensor_str._str(self, tensor_contents=tensor_contents)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 677, in _str
[rank0]: return _str_intern(self, tensor_contents=tensor_contents)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 597, in _str_intern
[rank0]: tensor_str = _tensor_str(self, indent)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 349, in _tensor_str
[rank0]: formatter = _Formatter(get_summarized_data(self) if summarize else self)
[rank0]: File "/scratch/project_465000670/danish-foundation-models/evaluation/.venv/lib/python3.10/site-packages/torch/_tensor_str.py", line 138, in __init__
[rank0]: tensor_view, torch.isfinite(tensor_view) & tensor_view.ne(0)
[rank0]: RuntimeError: HIP error: invalid device function
[rank0]: HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing AMD_SERIALIZE_KERNEL=3.
[rank0]: Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
Progress! What happens if you set AMD_SERIALIZE_KERNEL=3
? Maybe we'll get a more informative error.
@rlrs has this issue been resolved now?
has this issue been resolved now?
Example of command:
python benchmark_throughput.py --model gpt2 --input-len 256 --output-len 256
Output:
Installed packages:
This is running in the
rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1
container on a node with MI250X GPUs.