vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
21.92k stars 3.09k forks source link

RuntimeError while running any model with embeddedllminfo/vllm-rocm:vllm-v0.2.4 image and rocm5.7 (rhel 8.7) #3122

Open AjayKadoula opened 4 months ago

AjayKadoula commented 4 months ago

from vllm import LLM, SamplingParams

prompts = [ ... "Hello, my name is", ... "The president of the United States is", ... "The capital of France is", ... "The future of AI is", ... ]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="openlm-research/open_llama_13b") config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 507/507 [00:00<00:00, 4.62MB/s] INFO 02-20 07:45:55 llm_engine.py:73] Initializing an LLM engine with config: model='openlm-research/open_llama_13b', tokenizer='openlm-research/open_llama_13b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=pt, tensor_parallel_size=1, quantization=None, seed=0) INFO 02-20 07:45:55 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 593/593 [00:00<00:00, 5.34MB/s] tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████| 534k/534k [00:01<00:00, 520kB/s] special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████| 330/330 [00:00<00:00, 3.00MB/s] You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.0.1+gita61a294) Python 3.10.13 (you have 3.10.13) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details MegaBlocks not found. Please install it by pip install megablocks. STK not found: please see https://github.com/stanford-futuredata/stk /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") pytorch_model-00003-of-00003.bin: 100%|███████████████████████████████████████████████████████████████████| 6.18G/6.18G [13:52<00:00, 7.42MB/s] pytorch_model-00002-of-00003.bin: 100%|██████████████████████████████████████████████████████████████████▉| 9.89G/9.90G [21:29<00:02, 7.95MB/s]^[[Arch_model-00001-of-00003.bin: 61%|████████████████████████████████████████▋ | 6.04G/9.95G [13:50<09:14, 7.05MB/s^pytorch_model-00002-of-00003.bin: 100%|███████████████████████████████████████████████████████████████████| 9.90G/9.90G [21:31<00:00, 7.67MB/s] pytorch_model-00001-of-00003.bin: 100%|███████████████████████████████████████████████████████████████████| 9.95G/9.95G [22:26<00:00, 7.39MB/s] Traceback (most recent call last): 95%|███████████████████████████████████████████████████████████████▉ | 9.49G/9.95G [21:30<00:59, 7.78MB/s] File "", line 1, in %|███████████████████████████████████████████████████████████████████| 9.95G/9.95G [22:26<00:00, 7.15MB/s] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 93, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 246, in from_engine_args engine = cls(engine_configs, File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 112, in init self._init_cache() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 208, in _init_cache num_blocks = self._run_workers( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 750, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch output = executor(*args, *kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 91, in profile_num_available_blocks free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 618, in mem_get_info return torch.cuda.cudart().cudaMemGetInfo(device) RuntimeError: HIP error: invalid argument HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

System config: hostnamectl Operating System: Red Hat Enterprise Linux 8.7 (Ootpa) Kernel: Linux 4.18.0-425.3.1.el8.x86_64 Architecture: x86-64

rocm driver 5.7.0 amd driver: 5.7.0 vllm container version: embeddedllminfo/vllm-rocm vllm-v0.2.4 RHEL8.7 GPU:MI210

Also same config with RHEL8.8, It is working, But with 8.7 it is not working.

yunzhongOvO commented 3 months ago

same problem on same gpu... any progress?

AjayKadoula commented 2 months ago

same issue face in ubuntu also AMD_LOG_LEVEL=3

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 252kB/s] INFO 04-19 04:44:48 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 287kB/s] vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.19MB/s] merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 20.5MB/s] special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 646kB/s] :3:rocdevice.cpp :445 : 3852326123 us: [pid:9 tid:0x7fcd5fa0c4c0] Initializing HSA stack. :3:comgrctx.cpp :33 : 3852378915 us: [pid:9 tid:0x7fcd5fa0c4c0] Loading COMGR library. :3:rocdevice.cpp :211 : 3852378983 us: [pid:9 tid:0x7fcd5fa0c4c0] Numa selects cpu agent[0]=0x859e1f0(fine=0x7c1f0a0,coarse=0x96cc5f0) for gpu agent=0x96cb260 CPU<->GPU XGMI=0 :3:rocdevice.cpp :1715: 3852379594 us: [pid:9 tid:0x7fcd5fa0c4c0] Gfx Major/Minor/Stepping: 9/0/10 :3:rocdevice.cpp :1717: 3852379601 us: [pid:9 tid:0x7fcd5fa0c4c0] HMM support: 1, XNACK: 0, Direct host access: 0 :3:rocdevice.cpp :1719: 3852379605 us: [pid:9 tid:0x7fcd5fa0c4c0] Max SDMA Read Mask: 0x1e, Max SDMA Write Mask: 0x1f :3:hip_context.cpp :48 : 3852380443 us: [pid:9 tid:0x7fcd5fa0c4c0] Direct Dispatch: 1 :3:hip_device_runtime.cpp :637 : 3852919412 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6160 ) :3:hip_device_runtime.cpp :639 : 3852919436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_device_runtime.cpp :637 : 3852919489 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7fccaafe1f14 ) :3:hip_device_runtime.cpp :639 : 3852919494 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_device.cpp :463 : 3852919500 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600 ( 0x7ffc2e1c5bd8, 0 ) :3:hip_device.cpp :465 : 3852919507 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600: Returned hipSuccess : :3:hip_device_runtime.cpp :637 : 3852919622 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6198 ) :3:hip_device_runtime.cpp :639 : 3852919626 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852919647 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f04 ) :3:hip_device_runtime.cpp :630 : 3852919652 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :637 : 3852919658 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c5c80 ) :3:hip_device_runtime.cpp :639 : 3852919662 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_context.cpp :344 : 3852920392 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d18, 0x7ffc2e1c5d1c ) :3:hip_context.cpp :358 : 3852920400 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852920405 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f64 ) :3:hip_device_runtime.cpp :630 : 3852920409 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_context.cpp :344 : 3852920414 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d78, 0x7ffc2e1c5d7c ) :3:hip_context.cpp :358 : 3852920418 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852920425 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5ef4 ) :3:hip_device_runtime.cpp :630 : 3852920429 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_context.cpp :344 : 3852920432 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d08, 0x7ffc2e1c5d0c ) :3:hip_context.cpp :358 : 3852920436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852921568 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c6644 ) :3:hip_device_runtime.cpp :630 : 3852921575 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852921698 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c64b4 ) :3:hip_device_runtime.cpp :630 : 3852921701 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852921726 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c62c0 ) :3:hip_device_runtime.cpp :630 : 3852921730 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_memory.cpp :764 : 3852921741 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo ( 0x7ffc2e1c6298, 0x7ffc2e1c62a0 ) :1:rocdevice.cpp :1824: 3852921762 us: [pid:9 tid:0x7fcd5fa0c4c0] HSA_AMD_AGENT_INFO_MEMORY_AVAIL query failed. :3:hip_memory.cpp :777 : 3852921767 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo: Returned hipErrorInvalidValue : :3:hip_error.cpp :35 : 3852921769 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetLastError ( ) :3:hip_device_runtime.cpp :652 : 3852922327 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice ( 0 ) :3:hip_device_runtime.cpp :656 : 3852922332 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice: Returned hipSuccess : Traceback (most recent call last): File "/app/model/vllm_example.py", line 11, in llm = LLM(model="facebook/opt-125m") File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/entrypoints/llm.py", line 109, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 371, in from_engine_args engine = cls(*engine_configs, File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 120, in init self._init_workers() File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 163, in _init_workers self._run_workers("init_model") File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 1014, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/worker/worker.py", line 89, in init_model self.init_gpu_memory = torch.cuda.mem_get_info()[0] File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/cuda/memory.py", line 663, in mem_get_info return torch.cuda.cudart().cudaMemGetInfo(device) RuntimeError: HIP error: invalid argument HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

:1:hip_fatbin.cpp :83 : 3853425875 us: [pid:9 tid:0x7fcd5fa0c4c0] All Unique FDs are closed

gopikrishnan92 commented 3 weeks ago

is it solved?

linchen111 commented 3 days ago

is it solved?