[Bug]: CUDA device detection issue with KubeRay distributed inference for quantized models

jradikk commented 1 month ago

Your current environment

The output of `python collect_env.py`

```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A40 Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 20 Stepping: 6 BogoMIPS: 5786.40 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities Hypervisor vendor: VMware Virtualization type: full L1d cache: 960 KiB (20 instances) L1i cache: 640 KiB (20 instances) L2 cache: 25 MiB (20 instances) L3 cache: 480 MiB (20 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-9 NUMA node1 CPU(s): 10-19 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] botorch==0.8.5 [pip3] gpytorch==1.10 [pip3] msgpack-numpy==0.4.8 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] onnx==1.15.0 [pip3] onnxruntime==1.18.0 [pip3] pynvml==11.5.0 [pip3] pytorch-lightning==1.8.6 [pip3] pytorch-ranger==0.1.1 [pip3] pyzmq==26.0.3 [pip3] tf2onnx==1.15.1 [pip3] torch==2.3.0+cu121 [pip3] torch_cluster==1.6.3+pt23cu121 [pip3] torch_geometric==2.5.3 [pip3] torch-optimizer==0.3.0 [pip3] torch_scatter==2.1.2+pt23cu121 [pip3] torch_sparse==0.6.18+pt23cu121 [pip3] torch_spline_conv==1.2.2+pt23cu121 [pip3] torchmetrics==0.10.3 [pip3] torchtext==0.18.0+cpu [pip3] torchvision==0.18.0+cu121 [pip3] transformers==4.36.2 [pip3] triton==2.3.0 [conda] botorch 0.8.5 pypi_0 pypi [conda] gpytorch 1.10 pypi_0 pypi [conda] msgpack-numpy 0.4.8 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pytorch-lightning 1.8.6 pypi_0 pypi [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] pyzmq 26.0.3 pypi_0 pypi [conda] torch 2.3.0+cu121 pypi_0 pypi [conda] torch-cluster 1.6.3+pt23cu121 pypi_0 pypi [conda] torch-geometric 2.5.3 pypi_0 pypi [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torch-scatter 2.1.2+pt23cu121 pypi_0 pypi [conda] torch-sparse 0.6.18+pt23cu121 pypi_0 pypi [conda] torch-spline-conv 1.2.2+pt23cu121 pypi_0 pypi [conda] torchmetrics 0.10.3 pypi_0 pypi [conda] torchtext 0.18.0+cpu pypi_0 pypi [conda] torchvision 0.18.0+cu121 pypi_0 pypi [conda] transformers 4.36.2 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-19 0-1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

Model Input Dumps

No response

🐛 Describe the bug

I have 2 nodes with 1 GPU each in a Kubernetes environment. Each of those GPUs is too small to load the model (hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4) by itself, hence what I want is to run 2 replicas of vllm over ray to split the weights (PIPELINE_PARALLEL_SIZE config) between the two.

I was able to do that, using this guide. This is the command I used to launch quantized llama3.1:70b - python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name llama3.1:70b --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --max-model-len 4096 --tokenizer hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --dtype half -q marlin --tensor-parallel-size 1 --pipeline-parallel-size 2.

It successfully runs the model over 3 pods (ray head + 2 ray workers), where each worker has access to its own GPU (46GB)

However, I'm unable to do the same thing with Kuberay, launching a model via RayService. Whenever I try to launch a quantized model I get the following error: RuntimeError: CUDA_VISIBLE_DEVICES is set to empty string, which means GPU support is disabled.. However, I'm able to do the same thing using a non-quantized model. Together with KubeRay's maintainer we successfully launched llama3.1:8B and llama3.1:70B and were unable to launch the same models, but with marlin and gptq quantization

I have previously opened an issue in KubeRay.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

youkaichao commented 1 month ago

It successfully runs the model over 3 pods (ray head + 2 ray workers)

I'm not sure what's happening here, but I think it has something to do with the pod assignment. We require the process starting vLLM instance have access to GPUs.

if you create 3 pods while you only have 2 GPUs, it might cause problems.

jradikk commented 1 month ago

@youkaichao

I'm able to successfully launch a non-quantized model with this setup
I did test a setup where ray head also has access to GPU, but it doesn't change the situation

jradikk commented 1 month ago

@youkaichao Using a suggestion from this issue I hardcoded a setting of CUDA_VISIBLE_DEVICES and was able to make it work. I can only assume that either vllm or ray overrides this value somewhere along the way. Considering the fact that it only happens to quantized models, I guess that it is something vllm does, since ray does not care about quantization. I'd appreciate any help debugging this, since going to production with such a hardcode is impossible

youkaichao commented 1 month ago

try to insert these lines at the top of your code:

import torch
import os

import sys
found = False
import traceback

def _trace_calls(frame, event, arg=None):
    if event in ['call', 'return']:
        # for every function call or return
        try:
            global found
            # Temporarily disable the trace function
            sys.settrace(None)
            # check condition here
            if not found and "CUDA_VISIBLE_DEVICES" in os.environ and os.environ["CUDA_VISIBLE_DEVICES"] == "":
                found = True
                traceback.print_stack()
            # Re-enable the trace function
            sys.settrace(_trace_calls)
        except NameError:
            # modules are deleted during shutdown
            pass
    return _trace_calls
sys.settrace(_trace_calls)

and see which function changed the CUDA_VISIBLE_DEVICES variable.

alexdauenhauer commented 1 month ago

I have the same issue. Appears to have been introduced in 0.5.5 although the error messages is different there, but my exact same code runs fine on 0.5.4 with no other changes. here's the traceback using 0.5.5

 File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 771, in create_engine_config
    model_config = ModelConfig(
                   ^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 227, in __init__
    self._verify_quantization()
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 285, in _verify_quantization
    quantization_override = method.override_quantization_method(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 82, in override_quantization_method
    can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 127, in is_awq_marlin_compatible
    return check_marlin_supported(quant_type=cls.TYPE_MAP[num_bits],
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 78, in check_marlin_supported
    cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 55, in _check_marlin_supported
    major, minor = current_platform.get_device_capability()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 96, in get_device_capability
    physical_device_id = device_id_to_physical_device_id(device_id)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 86, in device_id_to_physical_device_id
    return int(physical_device_id)
           ^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''

and here's the traceback using 0.6.0

  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 874, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 811, in create_model_config
    return ModelConfig(
           ^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 223, in __init__
    self._verify_quantization()
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 286, in _verify_quantization
    quantization_override = method.override_quantization_method(
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 83, in override_quantization_method
    can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 128, in is_awq_marlin_compatible
    return check_marlin_supported(quant_type=cls.TYPE_MAP[num_bits],
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 80, in check_marlin_supported
    cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 56, in _check_marlin_supported
    capability_tuple = current_platform.get_device_capability()
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 101, in get_device_capability
    physical_device_id = device_id_to_physical_device_id(device_id)
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 88, in device_id_to_physical_device_id
    raise RuntimeError("CUDA_VISIBLE_DEVICES is set to empty string,"
RuntimeError: CUDA_VISIBLE_DEVICES is set to empty string, which means GPU support is disabled.

alexdauenhauer commented 2 weeks ago

@youkaichao wanted to check if there was any update on this?

youkaichao commented 2 weeks ago

@alexdauenhauer please follow the discussion to give enough information

mcd01 commented 1 day ago

We observed the same problem: With vLLM 0.5.4 we have no problem deploying quantized models (e.g., this one), however, the moment we select any version greater than 0.5.4 (tested with max. 0.6.2) we get the same problem as described above.

Using the hardcoded solution proposed here works, however we wanted something less invasive, so came up with the following solution, figured it might help someone:

# ...
# All your other imports come before
import vllm.platforms.cuda

logger = logging.getLogger("ray.serve")

app = FastAPI()

# vLLM has some issues in certain versions, which is why we introduce some additional logic
# https://github.com/vllm-project/vllm/issues/7890
# https://github.com/vllm-project/vllm/issues/8402
# Goal: If possible not invasive, non-blocking
# Save a reference to the original function
original_function = vllm.platforms.cuda.device_id_to_physical_device_id
def device_id_to_physical_device_id_wrapper(*args, **kwargs):
    logger.info(f"Hook: Executing code before calling "
                    f"'device_id_to_physical_device_id' (with args={args}, kwargs={kwargs}).")
    if not len(os.environ["CUDA_VISIBLE_DEVICES"]):
        try:
            import nvsmi
            gpu_count: int = len(list(nvsmi.get_gpus()))
            new_env_value: str = ",".join([str(n) for n in range(gpu_count)])
            os.environ["CUDA_VISIBLE_DEVICES"] = new_env_value
            logger.info(f"New value for environment variable 'CUDA_VISIBLE_DEVICES': {new_env_value}")
        except BaseException as e:
            logger.error(f"Could not derive gpu_count using 'nvsmi' library. Error: {e}")
    func_response = original_function(*args, **kwargs)
    logger.info(f"function 'device_id_to_physical_device_id' response: {func_response}")
    return func_response
# Replace the original function with the wrapped version
vllm.platforms.cuda.device_id_to_physical_device_id = device_id_to_physical_device_id_wrapper

@serve.deployment(name="VLLMDeployment")
@serve.ingress(app)
class VLLMDeployment:
# ...

So this is supposed to be inserted in / added to your vLLM deployment code. We add some logging for transparency, and try to derive the correct value of the environment variable in an automated manner using the nvsmi library (which of course then also needs to be a dependency of your deployment, installed, for instance, via pip), however, you can also provide the required information in another way.

youkaichao commented 1 day ago

@mcd01 I still don't understand the problem, who sets CUDA_VISIBLE_DEVICES to an empty string?

mcd01 commented 23 hours ago

@youkaichao Unfortunately I did not yet have the time to investigate it further

alexdauenhauer commented 15 hours ago

@youkaichao for me it is due to launching vllm in a ray cluster where the head node does not have GPU attached, only the worker nodes. I'll try some of the workarounds listed here

youkaichao commented 11 hours ago

@alexdauenhauer

launching vllm in a ray cluster where the head node does not have GPU attached

this is not supported in vllm. even if you solve this error, it will error later. you need to launch the vllm process in a node with gpus.

vllm-project / vllm