[Bug]: Docker image version 05.0 and 0.4.3 dont work with 4090's

Your current environment

Collecting environment information...
WARNING 06-13 12:05:09 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.5
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-35-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.2.140
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 4090
GPU 1: NVIDIA GeForce RTX 4090

Nvidia driver version: 535.171.04
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             32
On-line CPU(s) list:                0-31
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen Threadripper PRO 5955WX 16-Cores
CPU family:                         25
Model:                              8
Thread(s) per core:                 2
Core(s) per socket:                 16
Socket(s):                          1
Stepping:                           2
Frequency boost:                    enabled
CPU max MHz:                        7031.2500
CPU min MHz:                        1800.0000
BogoMIPS:                           7984.91
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin brs arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip pku ospke vaes vpclmulqdq rdpid overflow_recov succor smca fsrm
Virtualization:                     AMD-V
L1d cache:                          512 KiB (16 instances)
L1i cache:                          512 KiB (16 instances)
L2 cache:                           8 MiB (16 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-31
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Vulnerable: Safe RET, no microcode
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines; IBPB conditional; IBRS_FW; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] torchvision==0.18.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.5.0
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     0-31    0               N/A
GPU1    SYS      X      0-31    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

Docker Image, 0.5.0, and 0.4.3 do not work.

I can revert back to 0.4.2 and it works as expected.

nvidia-smi

Thu Jun  13 08:59:09 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        On  | 00000000:41:00.0 Off |                  Off |
|  0%   33C    P8              30W / 480W |    191MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        On  | 00000000:61:00.0 Off |                  Off |
|  0%   32C    P8              34W / 480W |     11MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                          167MiB |
|    0   N/A  N/A      1842      G   /usr/bin/gnome-shell                         14MiB |
|    1   N/A  N/A      1764      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

docker run --gpus all \
    -v ./data:/Data \
    -p 8000:8000 \
    --ipc=host \
    --env "HUGGING_FACE_HUB_TOKEN=hf_token" \
    vllm/vllm-openai:v0.4.3 \
    --model /Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ \
    --max-model-len 32000 \
    --tensor-parallel-size 2 \
    --gpu-memory-utilization .9 \
    --api-key sk-1234567890abcdef \
    --served-model-name model_name \
    --quantization awq \
    --dtype half

WARNING 06-06 14:09:52 config.py:213] awq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
2024-06-06 14:09:54,151 INFO worker.py:1749 -- Started a local Ray instance.
INFO 06-06 14:09:54 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ', speculative_config=None, tokenizer='/Data/Meta-Llama-3-8B-Instruct-32k-instruct-quantized-4bit-128qg-AWQ', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32000, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, disable_custom_all_reduce=False, quantization=awq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=model_name)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
ERROR 06-06 14:09:57 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
ERROR 06-06 14:09:57 worker_base.py:148] Traceback (most recent call last):
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-06 14:09:57 worker_base.py:148]     return executor(*args, **kwargs)
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
ERROR 06-06 14:09:57 worker_base.py:148]     self.worker = worker_class(*args, **kwargs)
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
ERROR 06-06 14:09:57 worker_base.py:148]     self.model_runner = ModelRunnerClass(
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
ERROR 06-06 14:09:57 worker_base.py:148]     self.attn_backend = get_attn_backend(
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
ERROR 06-06 14:09:57 worker_base.py:148]     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
ERROR 06-06 14:09:57 worker_base.py:148]     if torch.cuda.get_device_capability()[0] < 8:
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
ERROR 06-06 14:09:57 worker_base.py:148]     prop = get_device_properties(device)
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
ERROR 06-06 14:09:57 worker_base.py:148]     _lazy_init()  # will define _get_device_properties
ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
ERROR 06-06 14:09:57 worker_base.py:148]     torch._C._cuda_init()
ERROR 06-06 14:09:57 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
    engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 222, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 317, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 41, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
    self._init_workers_ray(placement_group)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 169, in _init_workers_ray
    self._run_workers("init_worker", all_kwargs=init_worker_all_kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
    driver_worker_output = self.driver_worker.execute_method(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 149, in execute_method
    raise e
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
    return executor(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
    self.worker = worker_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
    self.model_runner = ModelRunnerClass(
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
    self.attn_backend = get_attn_backend(
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
    backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
  File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
    if torch.cuda.get_device_capability()[0] < 8:
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
    prop = get_device_properties(device)
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
    _lazy_init()  # will define _get_device_properties
  File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] Error executing method init_worker. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/ray/util/tracing/tracing_helper.py", line 467, in _resume_span
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     return method(self, *_args, **_kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker_base.py", line 134, in init_worker
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     self.worker = worker_class(*args, **kwargs)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 74, in __init__
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     self.model_runner = ModelRunnerClass(
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/worker/model_runner.py", line 115, in __init__
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     self.attn_backend = get_attn_backend(
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 43, in get_attn_backend
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/vllm/attention/selector.py", line 118, in which_attn_to_use
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     if torch.cuda.get_device_capability()[0] < 8:
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     prop = get_device_properties(device)
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     _lazy_init()  # will define _get_device_properties
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]   File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 293, in _lazy_init
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148]     torch._C._cuda_init()
(RayWorkerWrapper pid=2148) ERROR 06-06 14:09:57 worker_base.py:148] RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

vllm-project / vllm

[Bug]: Docker image version 05.0 and 0.4.3 dont work with 4090's #5510

Your current environment

🐛 Describe the bug