[Bug]: vllm 0.4.0.post1 crashed when loading dbrx-instruct on AMD MI250x

Your current environment

vllm (commit db2a6a41e206abecf4128aba25117fcaf7bebe12) + ROCm 6.0 Docker image built with the fix of Dockerfile.rocm
4x AMD MI250x GPUs (each MI250x has 2 GPU dies, total 512GB GPU memory)
model: databricks/dbrx-instruct
🐛 Describe the bug

Ran vllm Docker image with docker run --network=host --group-add=video --ipc=host --cap-add=SYS_PTRACE --security-opt seccomp=unconfined --shm-size 32G --device /dev/kfd --device /dev/dri -v $model_dir:/app/model vllm-rocm:v0.4.0.post1 python -m vllm.entrypoints.openai.api_server --port 7860 --model /app/model/models--databricks--dbrx-instruct/snapshots/17365204e9cf13e2296ee984c1ab48071e861efa --trust-remote-code --tensor-parallel-size 8
The vllm server crashed soon after loading the model.
INFO 04-05 23:49:30 llm_engine.py:81] Initializing an LLM engine (v0.4.0.post1) with config: model='/app/model/models--databricks--dbrx-instruct/snapshots/17365204e9cf13e2296ee
984c1ab48071e861efa', speculative_config=None, tokenizer='/app/model/models--databricks--dbrx-instruct/snapshots/17365204e9cf13e2296ee984c1ab48071e861efa', tokenizer_mode=auto,
 revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=8, disable_c
ustom_all_reduce=True, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, seed=0)                                   
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.                                                           
WARNING 04-05 23:49:31 tokenizer.py:104] Using a slow tokenizer. This might cause a significant slowdown. Consider using a fast tokenizer instead.                              
INFO 04-05 23:49:46 pynccl.py:58] Loading nccl from library librccl.so.1                                                                                                        
INFO 04-05 23:49:46 selector.py:34] Cannot use FlashAttention backend for AMD GPUs.                                                                                             
INFO 04-05 23:49:46 selector.py:25] Using XFormers backend.                                                                                                                     
WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:                                                                                             
    PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.1.1+git011de5c)                                                                                                              
    Python  3.9.18 (you have 3.9.18)                                                                                                                                            
  Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)                                                                              
  Memory-efficient attention, SwiGLU, sparse and more won't be available.                                                                                                       
  Set XFORMERS_MORE_DETAILS=1 for more details                                                                                                                                  
(RayWorkerVllm pid=5498) WARNING[XFORMERS]: xFormers can't load C++/CUDA extensions. xFormers was built for:                                                                    
(RayWorkerVllm pid=5498)     PyTorch 2.1.1+cu121 with CUDA 1201 (you have 2.1.1+git011de5c)                                                                                     
(RayWorkerVllm pid=5498)     Python  3.9.18 (you have 3.9.18)                                                                                                                   
(RayWorkerVllm pid=5498)   Please reinstall xformers (see https://github.com/facebookresearch/xformers#installing-xformers)                            
(RayWorkerVllm pid=5498)   Memory-efficient attention, SwiGLU, sparse and more won't be available.                                                                              
(RayWorkerVllm pid=5498)   Set XFORMERS_MORE_DETAILS=1 for more details                                                                                                         
/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML                                                               
  warnings.warn("Can't initialize NVML")                                                                                                                                        
(RayWorkerVllm pid=5498) /opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/cuda/__init__.py:611: UserWarning: Can't initialize NVML                                      
(RayWorkerVllm pid=5498)   warnings.warn("Can't initialize NVML")                                                                                                               
(RayWorkerVllm pid=5498) INFO 04-05 23:49:47 pynccl.py:58] Loading nccl from library librccl.so.1                                                                           
(RayWorkerVllm pid=5498) INFO 04-05 23:49:47 selector.py:34] Cannot use FlashAttention backend for AMD GPUs.                                                                    
(RayWorkerVllm pid=5498) INFO 04-05 23:49:47 selector.py:25] Using XFormers backend.                                                                                            
(RayWorkerVllm pid=5498) INFO 04-05 23:49:48 pynccl_utils.py:45] vLLM is using nccl==2.18.3                                                                                     
INFO 04-05 23:49:49 pynccl_utils.py:45] vLLM is using nccl==2.18.3                                                                                                              
INFO 04-05 23:50:13 model_runner.py:104] Loading model weights took 30.6567 GB                                                                                                  
error: LLVM Translation failed for operation: builtin.unrealized_conversion_cast                                                                                                
Failed to emit LLVM IR                                                                                                                                                          
Translate to LLVM IR failedLLVM ERROR: Failed to translate TritonGPU to LLVM IR.                                                                                                
*** SIGABRT received at time=1712361058 on cpu 4 ***                                                                                                                            
PC: @     0x7f47c6d6b00b  (unknown)  raise                                                                                                                                      
    @     0x7f47c7088420  (unknown)  (unknown)                     
    @        0x100000000  (unknown)  (unknown)                                                                                                                                  
    @ ... and at least 1 more frames                                                                                                                                            
[2024-04-05 23:50:58,762 E 1 1] logging.cc:361: *** SIGABRT received at time=1712361058 on cpu 4 ***                                                                            
[2024-04-05 23:50:58,762 E 1 1] logging.cc:361: PC: @     0x7f47c6d6b00b  (unknown)  raise                                                                                      
[2024-04-05 23:50:58,763 E 1 1] logging.cc:361:     @     0x7f47c7088420  (unknown)  (unknown)                                                                                  
[2024-04-05 23:50:58,765 E 1 1] logging.cc:361:     @        0x100000000  (unknown)  (unknown)                                                                                  
[2024-04-05 23:50:58,765 E 1 1] logging.cc:361:     @ ... and at least 1 more frames                                                                                            
Fatal Python error: Aborted                                                                                                                                                     

Stack (most recent call first):                                                                                                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 114 in ttgir_to_llir                                                              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 417 in <lambda>                                                                   
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 509 in compile                                                                    
  File "<string>", line 63 in fused_moe_kernel                                                                                                                                  
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222 in invok
e_fused_moe_kernel                                                                                                                                                              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397 in fused
_moe                                                                                                                                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 148 in forward             
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 302 in forward             
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 338 in forward             
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 377 in forward         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/worker/model_runner.py", line 683 in execute_model              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/worker/model_runner.py", line 762 in profile_run                
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/worker/worker.py", line 131 in profile_num_available_blocks     
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 328 in _run_workers         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 224 in _init_cache          
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 69 in __init__    
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 119 in __init__                     
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 421 in _init_engine           
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 311 in __init__               
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 347 in from_engine_args       
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 157 in <module>         
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 87 in _run_code                                                                                                    
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 197 in _run_module_as_main                                                                                         
[failure_signal_handler.cc : 332] RAW: Signal 11 raised at PC=0x7f47c6d4a941 while already in AbslFailureSignalHandler()                                                        
*** SIGSEGV received at time=1712361058 on cpu 4 ***                                                                                                                            
PC: @     0x7f47c6d4a941  (unknown)  abort                                                                                                                                      
    @     0x7f47c7088420  (unknown)  (unknown)                                                                                                                                  
    @        0x100000000  (unknown)  (unknown)                                                                                                                                  
    @ ... and at least 1 more frames                                                                                                                                            
[2024-04-05 23:50:58,768 E 1 1] logging.cc:361: *** SIGSEGV received at time=1712361058 on cpu 4 ***                                                                            
[2024-04-05 23:50:58,768 E 1 1] logging.cc:361: PC: @     0x7f47c6d4a941  (unknown)  abort                                                                                      
[2024-04-05 23:50:58,769 E 1 1] logging.cc:361:     @     0x7f47c7088420  (unknown)  (unknown)                                                                                  
[2024-04-05 23:50:58,770 E 1 1] logging.cc:361:     @        0x100000000  (unknown)  (unknown)                                                                                  
[2024-04-05 23:50:58,770 E 1 1] logging.cc:361:     @ ... and at least 1 more frames                                                                                            
Fatal Python error: Segmentation fault                                                                                                                                          

Stack (most recent call first):                                                                                                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 114 in ttgir_to_llir                                                              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 417 in <lambda>                                                                   
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/triton/compiler/compiler.py", line 509 in compile                                                                    
  File "<string>", line 63 in fused_moe_kernel                                                                                                                                  
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 222 in invok
e_fused_moe_kernel                                                                                                                                                              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/layers/fused_moe/fused_moe.py", line 397 in fused
_moe                                                                                                                                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 148 in forward             
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 302 in forward             
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 338 in forward             
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/model_executor/models/dbrx.py", line 377 in forward            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1527 in _call_impl                                                                 
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1518 in _wrapped_call_impl                                                         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/worker/model_runner.py", line 683 in execute_model              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/worker/model_runner.py", line 762 in profile_run                
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/worker/worker.py", line 131 in profile_num_available_blocks     
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115 in decorate_context                                                            
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 328 in _run_workers         
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 224 in _init_cache          
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/executor/ray_gpu_executor.py", line 69 in __init__              
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/llm_engine.py", line 119 in __init__                     
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 421 in _init_engine           
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 311 in __init__               
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/engine/async_llm_engine.py", line 347 in from_engine_args       
  File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/vllm-0.4.0.post1+rocm603-py3.9-linux-x86_64.egg/vllm/entrypoints/openai/api_server.py", line 157 in <module>         
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 87 in _run_code                                                                                                    
  File "/opt/conda/envs/py_3.9/lib/python3.9/runpy.py", line 197 in _run_module_as_main
vllm-project / vllm

[Bug]: vllm 0.4.0.post1 crashed when loading dbrx-instruct on AMD MI250x #3878

Your current environment

🐛 Describe the bug