vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.79k stars 4.27k forks source link

[Bug]: Installed vllm successfully for AMD MI60 but inference is failing #9319

Open Said-Akbar opened 1 week ago

Said-Akbar commented 1 week ago

Your current environment

The output of `python collect_env.py` ```text python collect_env.py Collecting environment information... WARNING 10-12 21:26:08 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. WARNING 10-12 21:26:08 _custom_ops.py:19] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") PyTorch version: 2.6.0.dev20241011+rocm6.2 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.2.41133-dd7f95766 OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Radeon Graphics (gfx906:sramecc+:xnack-) Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: 6.2.41133 MIOpen runtime version: 3.2.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6800.12 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] pynvml==11.5.3 [pip3] pytorch-triton-rocm==3.1.0+cf34004b8a [pip3] pyzmq==26.2.0 [pip3] torch==2.6.0.dev20241011+rocm6.2 [pip3] torchaudio==2.5.0.dev20241011+rocm6.2 [pip3] torchvision==0.20.0.dev20241011+rocm6.2 [pip3] transformers==4.45.2 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: 6.2.41134-65d174c3e Neuron SDK Version: N/A vLLM Version: 0.6.3.dev187+g250e26a6 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

I have 2x AMD MI60 and 1x 3060 for video output. I installed all the dependencies for vllm under ROCm successfully. However, when I try to deploy a model, I am facing an error. I ran this command in terminal: vllm serve /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct

This is the error ```text WARNING 10-12 21:29:27 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-12 21:29:30 api_server.py:528] vLLM API server version 0.6.3.dev187+g250e26a6 INFO 10-12 21:29:30 api_server.py:529] args: Namespace(subparser='serve', model_tag='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=) INFO 10-12 21:29:30 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/8629bc8f-4416-41bf-91d8-8f128b0e586c for IPC Path. INFO 10-12 21:29:30 api_server.py:179] Started engine process with PID 275069 WARNING 10-12 21:29:31 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-12 21:29:34 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-12 21:29:38 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-12 21:29:38 llm_engine.py:237] Initializing an LLM engine (v0.6.3.dev187+g250e26a6) with config: model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', speculative_config=None, tokenizer='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-12 21:29:38 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-12 21:29:38 selector.py:120] Using ROCmFlashAttention backend. INFO 10-12 21:29:38 model_runner.py:1045] Starting to load model /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct... INFO 10-12 21:29:38 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-12 21:29:38 selector.py:120] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 sys.exit(load_entry_point('vllm==0.6.3.dev187+g250e26a6.rocm624', 'console_scripts', 'vllm')()) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/vllm-0.6.3.dev187+g250e26a6.rocm624-py3.10-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start ```

It seems like ROCm is missing paged_attention. And yes, I tried a clean install of VLLM again and that did not work as well. Let me know what is missing or how to fix this bug. Thanks!

Before submitting a new issue...

amd-abhikulk commented 5 days ago

It looks like your vllm wasn't installed with rocm support. Try VLLM_TARGET_DEVICE=rocm pip3 install . to install vllm

Said-Akbar commented 5 days ago

Thanks! I reinstalled vllm with that command. Here is my collect_env details now

The output of `python collect_env.py` ```text python collect_env.py Collecting environment information... WARNING 10-15 09:11:56 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. PyTorch version: 2.6.0.dev20241011+rocm6.2 Is debug build: False CUDA used to build PyTorch: N/A ROCM used to build PyTorch: 6.2.41133-dd7f95766 OS: Ubuntu 22.04.5 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.4 Libc version: glibc-2.35 Python version: 3.10.12 (main, Sep 11 2024, 15:47:36) [GCC 11.4.0] (64-bit runtime) Python platform: Linux-6.8.0-45-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.5.119 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: AMD Radeon Graphics (gfx906:sramecc+:xnack-) Nvidia driver version: 550.90.07 cuDNN version: Could not collect HIP runtime version: 6.2.41133 MIOpen runtime version: 3.2.0 Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD Ryzen 9 5950X 16-Core Processor CPU family: 25 Model: 33 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 1 Stepping: 0 Frequency boost: enabled CPU max MHz: 5083.3979 CPU min MHz: 2200.0000 BogoMIPS: 6800.44 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local user_shstk clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm debug_swap L1d cache: 512 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 8 MiB (16 instances) L3 cache: 64 MiB (2 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] mypy-protobuf==3.6.0 [pip3] numpy==1.26.4 [pip3] pynvml==11.5.3 [pip3] pytorch-triton-rocm==3.1.0+cf34004b8a [pip3] pyzmq==26.2.0 [pip3] torch==2.6.0.dev20241011+rocm6.2 [pip3] torchaudio==2.5.0.dev20241011+rocm6.2 [pip3] torchvision==0.20.0.dev20241011+rocm6.2 [pip3] transformers==4.45.2 [pip3] triton==2.1.0 [conda] Could not collect ROCM Version: 6.2.41134-65d174c3e Neuron SDK Version: N/A vLLM Version: 0.6.4.dev8+ge9d517f2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-31 0 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Now, when I ran this command I got this error:

```text vllm serve /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct WARNING 10-15 08:52:42 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 08:52:45 api_server.py:528] vLLM API server version 0.6.4.dev8+ge9d517f2 INFO 10-15 08:52:45 api_server.py:529] args: Namespace(subparser='serve', model_tag='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=) INFO 10-15 08:52:45 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/4d1778d4-a520-4715-9d97-75468731a52b for IPC Path. INFO 10-15 08:52:45 api_server.py:179] Started engine process with PID 15314 WARNING 10-15 08:52:46 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 08:52:49 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 08:52:52 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 08:52:52 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev8+ge9d517f2) with config: model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', speculative_config=None, tokenizer='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-15 08:52:52 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 08:52:52 selector.py:120] Using ROCmFlashAttention backend. INFO 10-15 08:52:52 model_runner.py:1060] Starting to load model /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct... INFO 10-15 08:52:52 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 08:52:52 selector.py:120] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start ```

I think vllm is not using my compiled version of triton. vllm engine defaults to system installed pytorch-triton-rocm (Version: 3.1.0+cf34004b8a). if I uninstall pytorch-triton-rocm, vllm shows an error that pytorch-triton-rocm is missing but in fact I already have a compiled triton.

Then I tried with export VLLM_USE_TRITON_FLASH_ATTN=0.

Here is the error I got.

```text vllm serve /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct WARNING 10-15 09:10:31 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 09:10:34 api_server.py:528] vLLM API server version 0.6.4.dev8+ge9d517f2 INFO 10-15 09:10:34 api_server.py:529] args: Namespace(subparser='serve', model_tag='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=['*'], allowed_methods=['*'], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format='auto', dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=None, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=) INFO 10-15 09:10:34 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/e3e70524-e140-44bc-aedf-018d11ae4773 for IPC Path. INFO 10-15 09:10:34 api_server.py:179] Started engine process with PID 21105 WARNING 10-15 09:10:35 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead, and make sure to uninstall `pynvml`. When both of them are installed, `pynvml` will take precedence and cause errors. See https://pypi.org/project/pynvml for more information. INFO 10-15 09:10:38 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 09:10:41 config.py:916] Disabled the custom all-reduce kernel because it is not supported on AMD GPUs. INFO 10-15 09:10:41 llm_engine.py:237] Initializing an LLM engine (v0.6.4.dev8+ge9d517f2) with config: model='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', speculative_config=None, tokenizer='/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None) INFO 10-15 09:10:41 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 09:10:41 selector.py:120] Using ROCmFlashAttention backend. INFO 10-15 09:10:41 model_runner.py:1060] Starting to load model /media/saidp/datasets/text_generation/models/unsloth_llama-3-8b-Instruct... INFO 10-15 09:10:41 selector.py:215] flash_attn is not supported on NAVI GPUs. INFO 10-15 09:10:41 selector.py:120] Using ROCmFlashAttention backend. Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00 sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 195, in main args.dispatch_function(args) File "/home/saidp/Downloads/amd_llm/vllm/vllm/scripts.py", line 41, in serve uvloop.run(run_server(args)) File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 82, in run return loop.run_until_complete(wrapper()) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete File "/home/saidp/Downloads/amd_llm/myenv/lib/python3.10/site-packages/uvloop/__init__.py", line 61, in wrapper return await main File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 552, in run_server async with build_async_engine_client(args) as engine_client: File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client async with build_async_engine_client_from_engine_args( File "/usr/lib/python3.10/contextlib.py", line 199, in __aenter__ return await anext(self.gen) File "/home/saidp/Downloads/amd_llm/vllm/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args raise RuntimeError( RuntimeError: Engine process failed to start ```

It is again a similar message that says ROCm is missing paged_attention. Please, let me know if you have a fix? Also, how were you able to run vllm if you have AMD GPUs. Thanks!

amd-abhikulk commented 4 days ago

Why don't you try to use amd's fork https://github.com/ROCm/vllm for vllm instead of this one, they might have that support , also this vllm has support mostly for GPUs with CDNA architecture, while yours is Vega20, That might also be a problem