vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
30.23k stars 4.57k forks source link

RuntimeError while running any model with embeddedllminfo/vllm-rocm:vllm-v0.2.4 image and rocm5.7 (rhel 8.7) #3122

Open AjayKadoula opened 8 months ago

AjayKadoula commented 8 months ago

from vllm import LLM, SamplingParams

prompts = [ ... "Hello, my name is", ... "The president of the United States is", ... "The capital of France is", ... "The future of AI is", ... ]

sampling_params = SamplingParams(temperature=0.8, top_p=0.95) llm = LLM(model="openlm-research/open_llama_13b") config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████| 507/507 [00:00<00:00, 4.62MB/s] INFO 02-20 07:45:55 llm_engine.py:73] Initializing an LLM engine with config: model='openlm-research/open_llama_13b', tokenizer='openlm-research/open_llama_13b', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=pt, tensor_parallel_size=1, quantization=None, seed=0) INFO 02-20 07:45:55 tokenizer.py:32] For some LLaMA V1 models, initializing the fast tokenizer may take a long time. To reduce the initialization time, consider using 'hf-internal-testing/llama-tokenizer' instead of the original tokenizer. tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████████████████████████| 593/593 [00:00<00:00, 5.34MB/s] tokenizer.model: 100%|███████████████████████████████████████████████████████████████████████████████████████| 534k/534k [00:01<00:00, 520kB/s] special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████| 330/330 [00:00<00:00, 3.00MB/s] You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for: PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.0.1+gita61a294) Python 3.10.13 (you have 3.10.13) Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers) Memory-efficient attention, SwiGLU, sparse and more won't be available. Set XFORMERS_MORE_DETAILS=1 for more details MegaBlocks not found. Please install it by pip install megablocks. STK not found: please see https://github.com/stanford-futuredata/stk /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/init.py:546: UserWarning: Can't initialize NVML warnings.warn("Can't initialize NVML") pytorch_model-00003-of-00003.bin: 100%|███████████████████████████████████████████████████████████████████| 6.18G/6.18G [13:52<00:00, 7.42MB/s] pytorch_model-00002-of-00003.bin: 100%|██████████████████████████████████████████████████████████████████▉| 9.89G/9.90G [21:29<00:02, 7.95MB/s]^[[Arch_model-00001-of-00003.bin: 61%|████████████████████████████████████████▋ | 6.04G/9.95G [13:50<09:14, 7.05MB/s^pytorch_model-00002-of-00003.bin: 100%|███████████████████████████████████████████████████████████████████| 9.90G/9.90G [21:31<00:00, 7.67MB/s] pytorch_model-00001-of-00003.bin: 100%|███████████████████████████████████████████████████████████████████| 9.95G/9.95G [22:26<00:00, 7.39MB/s] Traceback (most recent call last): 95%|███████████████████████████████████████████████████████████████▉ | 9.49G/9.95G [21:30<00:59, 7.78MB/s] File "", line 1, in %|███████████████████████████████████████████████████████████████████| 9.95G/9.95G [22:26<00:00, 7.15MB/s] File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/entrypoints/llm.py", line 93, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 246, in from_engine_args engine = cls(engine_configs, File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 112, in init self._init_cache() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 208, in _init_cache num_blocks = self._run_workers( File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 750, in _run_workers self._run_workers_in_batch(workers, method, args, kwargs)) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/engine/llm_engine.py", line 724, in _run_workers_in_batch output = executor(*args, *kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, kwargs) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/vllm-0.2.4+rocm573-py3.10-linux-x86_64.egg/vllm/worker/worker.py", line 91, in profile_num_available_blocks free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info() File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/cuda/memory.py", line 618, in mem_get_info return torch.cuda.cudart().cudaMemGetInfo(device) RuntimeError: HIP error: invalid argument HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

System config: hostnamectl Operating System: Red Hat Enterprise Linux 8.7 (Ootpa) Kernel: Linux 4.18.0-425.3.1.el8.x86_64 Architecture: x86-64

rocm driver 5.7.0 amd driver: 5.7.0 vllm container version: embeddedllminfo/vllm-rocm vllm-v0.2.4 RHEL8.7 GPU:MI210

Also same config with RHEL8.8, It is working, But with 8.7 it is not working.

yunzhongOvO commented 8 months ago

same problem on same gpu... any progress?

AjayKadoula commented 7 months ago

same issue face in ubuntu also AMD_LOG_LEVEL=3

config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 651/651 [00:00<00:00, 252kB/s] INFO 04-19 04:44:48 llm_engine.py:79] Initializing an LLM engine with config: model='facebook/opt-125m', tokenizer='facebook/opt-125m', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) tokenizer_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████| 685/685 [00:00<00:00, 287kB/s] vocab.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 899k/899k [00:00<00:00, 1.19MB/s] merges.txt: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████| 456k/456k [00:00<00:00, 20.5MB/s] special_tokens_map.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████| 441/441 [00:00<00:00, 646kB/s] :3:rocdevice.cpp :445 : 3852326123 us: [pid:9 tid:0x7fcd5fa0c4c0] Initializing HSA stack. :3:comgrctx.cpp :33 : 3852378915 us: [pid:9 tid:0x7fcd5fa0c4c0] Loading COMGR library. :3:rocdevice.cpp :211 : 3852378983 us: [pid:9 tid:0x7fcd5fa0c4c0] Numa selects cpu agent[0]=0x859e1f0(fine=0x7c1f0a0,coarse=0x96cc5f0) for gpu agent=0x96cb260 CPU<->GPU XGMI=0 :3:rocdevice.cpp :1715: 3852379594 us: [pid:9 tid:0x7fcd5fa0c4c0] Gfx Major/Minor/Stepping: 9/0/10 :3:rocdevice.cpp :1717: 3852379601 us: [pid:9 tid:0x7fcd5fa0c4c0] HMM support: 1, XNACK: 0, Direct host access: 0 :3:rocdevice.cpp :1719: 3852379605 us: [pid:9 tid:0x7fcd5fa0c4c0] Max SDMA Read Mask: 0x1e, Max SDMA Write Mask: 0x1f :3:hip_context.cpp :48 : 3852380443 us: [pid:9 tid:0x7fcd5fa0c4c0] Direct Dispatch: 1 :3:hip_device_runtime.cpp :637 : 3852919412 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6160 ) :3:hip_device_runtime.cpp :639 : 3852919436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_device_runtime.cpp :637 : 3852919489 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7fccaafe1f14 ) :3:hip_device_runtime.cpp :639 : 3852919494 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_device.cpp :463 : 3852919500 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600 ( 0x7ffc2e1c5bd8, 0 ) :3:hip_device.cpp :465 : 3852919507 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevicePropertiesR0600: Returned hipSuccess : :3:hip_device_runtime.cpp :637 : 3852919622 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c6198 ) :3:hip_device_runtime.cpp :639 : 3852919626 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852919647 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f04 ) :3:hip_device_runtime.cpp :630 : 3852919652 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :637 : 3852919658 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount ( 0x7ffc2e1c5c80 ) :3:hip_device_runtime.cpp :639 : 3852919662 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDeviceCount: Returned hipSuccess : :3:hip_context.cpp :344 : 3852920392 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d18, 0x7ffc2e1c5d1c ) :3:hip_context.cpp :358 : 3852920400 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852920405 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5f64 ) :3:hip_device_runtime.cpp :630 : 3852920409 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_context.cpp :344 : 3852920414 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d78, 0x7ffc2e1c5d7c ) :3:hip_context.cpp :358 : 3852920418 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852920425 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c5ef4 ) :3:hip_device_runtime.cpp :630 : 3852920429 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_context.cpp :344 : 3852920432 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState ( 0, 0x7ffc2e1c5d08, 0x7ffc2e1c5d0c ) :3:hip_context.cpp :358 : 3852920436 us: [pid:9 tid:0x7fcd5fa0c4c0] hipDevicePrimaryCtxGetState: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852921568 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c6644 ) :3:hip_device_runtime.cpp :630 : 3852921575 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852921698 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c64b4 ) :3:hip_device_runtime.cpp :630 : 3852921701 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_device_runtime.cpp :622 : 3852921726 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice ( 0x7ffc2e1c62c0 ) :3:hip_device_runtime.cpp :630 : 3852921730 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetDevice: Returned hipSuccess : :3:hip_memory.cpp :764 : 3852921741 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo ( 0x7ffc2e1c6298, 0x7ffc2e1c62a0 ) :1:rocdevice.cpp :1824: 3852921762 us: [pid:9 tid:0x7fcd5fa0c4c0] HSA_AMD_AGENT_INFO_MEMORY_AVAIL query failed. :3:hip_memory.cpp :777 : 3852921767 us: [pid:9 tid:0x7fcd5fa0c4c0] hipMemGetInfo: Returned hipErrorInvalidValue : :3:hip_error.cpp :35 : 3852921769 us: [pid:9 tid:0x7fcd5fa0c4c0] hipGetLastError ( ) :3:hip_device_runtime.cpp :652 : 3852922327 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice ( 0 ) :3:hip_device_runtime.cpp :656 : 3852922332 us: [pid:9 tid:0x7fcd5fa0c4c0] hipSetDevice: Returned hipSuccess : Traceback (most recent call last): File "/app/model/vllm_example.py", line 11, in llm = LLM(model="facebook/opt-125m") File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/entrypoints/llm.py", line 109, in init self.llm_engine = LLMEngine.from_engine_args(engine_args) File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 371, in from_engine_args engine = cls(*engine_configs, File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 120, in init self._init_workers() File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 163, in _init_workers self._run_workers("init_model") File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 1014, in _run_workers driver_worker_output = getattr(self.driver_worker, File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.3.2+rocm603-py3.9-linux-x86_64.egg/vllm/worker/worker.py", line 89, in init_model self.init_gpu_memory = torch.cuda.mem_get_info()[0] File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/cuda/memory.py", line 663, in mem_get_info return torch.cuda.cudart().cudaMemGetInfo(device) RuntimeError: HIP error: invalid argument HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing HIP_LAUNCH_BLOCKING=1. Compile with TORCH_USE_HIP_DSA to enable device-side assertions.

:1:hip_fatbin.cpp :83 : 3853425875 us: [pid:9 tid:0x7fcd5fa0c4c0] All Unique FDs are closed

gopikrishnan92 commented 5 months ago

is it solved?

linchen111 commented 4 months ago

is it solved?

atallahwa2 commented 4 months ago

Same problem on same gpu

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!