[Bug]: Installed vllm successfully for AMD MI60 but inference is failing

Your current environment

The output of `python collect_env.py`

```text python collect_env.py Collecting environment information... WARNING 10-12 21:26:08 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. WARNING 10-12 21:26:08 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") PyTorch version: 2.6.0.dev20241011+rocm6.2 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.2.41133-dd7f95766 OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Radeon Graphics (gfx906:sramecc+:xnack-) Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: 6.2.41133 MIOpen runtime version: 3.2.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6800.12 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] pynvml==11.5.3 [pip3] pytorch-triton-rocm==3.1.0+cf34004b8a [pip3] pyzmq==26.2.0 [pip3] torch==2.6.0.dev20241011+rocm6.2 [pip3] torchaudio==2.5.0.dev20241011+rocm6.2 [pip3] torchvision==0.20.0.dev20241011+rocm6.2 [pip3] transformers==4.45.2 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: 6.2.41134-65d174c3e Neuron SDK Version: N/A vLLM Version: 0.6.3.dev187+g250e26a6 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

I have 2x AMD MI60 and 1x 3060 for video output. I installed all the dependencies for vllm under ROCm successfully. However, when I try to deploy a model, I am facing an error. I ran this command in terminal: vllm serve /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct

This is the error

```text WARNING 10-12 21:29:27 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-12 21:29:30 api_server.py:528] vLLM API server version 0.6.3.dev187+g250e26a6 INFO 10-12 21:29:30 api_server.py:529] args: Namespace(subparser='serve', model_tag='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=) INFO 10-12 21:29:30 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/8629bc8f-4416-41bf-91d8-8f128b0e586c for IPC Path. INFO 10-12 21:29:30 api_server.py:179] Started engine process with PID 275069 WARNING 10-12 21:29:31 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-12 21:29:34 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-12 21:29:38 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-12 21:29:38 llm_engine.py:237] Initializing an LLM engine (v0.6.3.dev187+g250e26a6) with config: model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', speculative_config=None, tokenizer='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-12 21:29:38 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-12 21:29:38 selector.py:120] Using ROCmFlashAttention backend. INFO 10-12 21:29:38 model_runner.py:1045] Starting to load model /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct... INFO 10-12 21:29:38 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-12 21:29:38 selector.py:120] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 sys.exit(load_entry_point('vllm==0.6.3.dev187+g250e26a6.rocm624', 'console_scripts', 'vllm')()) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start ```

It seems like ROCm is missing paged_attention. And yes, I tried a clean install of VLLM again and that did not work as well. Let me know what is missing or how to fix this bug. Thanks!

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Thanks! I reinstalled vllm with that command. Here is my collect_env details now

The output of `python collect_env.py`

```text python collect_env.py Collecting environment information... WARNING 10-15 09:11:56 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. PyTorch version: 2.6.0.dev20241011+rocm6.2 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.2.41133-dd7f95766 OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Radeon Graphics (gfx906:sramecc+:xnack-) Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: 6.2.41133 MIOpen runtime version: 3.2.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6800.44 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-protobuf==3.6.0 [pip3] numpy==1.26.4 [pip3] pynvml==11.5.3 [pip3] pytorch-triton-rocm==3.1.0+cf34004b8a [pip3] pyzmq==26.2.0 [pip3] torch==2.6.0.dev20241011+rocm6.2 [pip3] torchaudio==2.5.0.dev20241011+rocm6.2 [pip3] torchvision==0.20.0.dev20241011+rocm6.2 [pip3] transformers==4.45.2 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: 6.2.41134-65d174c3e Neuron SDK Version: N/A vLLM Version: 0.6.4.dev8+ge9d517f2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Now, when I ran this command I got this error:

```text vllm serve /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct WARNING 10-15 08:52:42 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 08:52:45 api_server.py:528] vLLM API server version 0.6.4.dev8+ge9d517f2 INFO 10-15 08:52:45 api_server.py:529] args: Namespace(subparser='serve', model_tag='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=) INFO 10-15 08:52:45 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/4d1778d4-a520-4715-9d97-75468731a52b for IPC Path. INFO 10-15 08:52:45 api_server.py:179] Started engine process with PID 15314 WARNING 10-15 08:52:46 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 08:52:49 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 08:52:52 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 08:52:52 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev8+ge9d517f2) with config: model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', speculative_config=None, tokenizer='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-15 08:52:52 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 08:52:52 selector.py:120] Using ROCmFlashAttention backend. INFO 10-15 08:52:52 model_runner.py:1060] Starting to load model /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct... INFO 10-15 08:52:52 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 08:52:52 selector.py:120] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start ```

I think vllm is not using my compiled version of triton. vllm engine defaults to system installed pytorch-triton-rocm (Version: 3.1.0+cf34004b8a). if I uninstall pytorch-triton-rocm, vllm shows an error that pytorch-triton-rocm is missing but in fact I already have a compiled triton.

Then I tried with export VLLM_USE_TRITON_FLASH_ATTN=0.

Here is the error I got.

```text vllm serve /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct WARNING 10-15 09:10:31 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 09:10:34 api_server.py:528] vLLM API server version 0.6.4.dev8+ge9d517f2 INFO 10-15 09:10:34 api_server.py:529] args: Namespace(subparser='serve', model_tag='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=) INFO 10-15 09:10:34 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/e3e70524-e140-44bc-aedf-018d11ae4773 for IPC Path. INFO 10-15 09:10:34 api_server.py:179] Started engine process with PID 21105 WARNING 10-15 09:10:35 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 09:10:38 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 09:10:41 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 09:10:41 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev8+ge9d517f2) with config: model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', speculative_config=None, tokenizer='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-15 09:10:41 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 09:10:41 selector.py:120] Using ROCmFlashAttention backend. INFO 10-15 09:10:41 model_runner.py:1060] Starting to load model /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct... INFO 10-15 09:10:41 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 09:10:41 selector.py:120] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start ```

It is again a similar message that says ROCm is missing paged_attention. Please, let me know if you have a fix? Also, how were you able to run vllm if you have AMD GPUs. Thanks!

vllm-project / vllm