Open jradikk opened 1 month ago
It successfully runs the model over 3 pods (ray head + 2 ray workers)
I'm not sure what's happening here, but I think it has something to do with the pod assignment. We require the process starting vLLM instance have access to GPUs.
if you create 3 pods while you only have 2 GPUs, it might cause problems.
@youkaichao
@youkaichao Using a suggestion from this issue I hardcoded a setting of CUDA_VISIBLE_DEVICES and was able to make it work. I can only assume that either vllm or ray overrides this value somewhere along the way. Considering the fact that it only happens to quantized models, I guess that it is something vllm does, since ray does not care about quantization. I'd appreciate any help debugging this, since going to production with such a hardcode is impossible
try to insert these lines at the top of your code:
import torch
import os
import sys
found = False
import traceback
def _trace_calls(frame, event, arg=None):
if event in ['call', 'return']:
# for every function call or return
try:
global found
# Temporarily disable the trace function
sys.settrace(None)
# check condition here
if not found and "CUDA_VISIBLE_DEVICES" in os.environ and os.environ["CUDA_VISIBLE_DEVICES"] == "":
found = True
traceback.print_stack()
# Re-enable the trace function
sys.settrace(_trace_calls)
except NameError:
# modules are deleted during shutdown
pass
return _trace_calls
sys.settrace(_trace_calls)
and see which function changed the CUDA_VISIBLE_DEVICES
variable.
I have the same issue. Appears to have been introduced in 0.5.5 although the error messages is different there, but my exact same code runs fine on 0.5.4 with no other changes. here's the traceback using 0.5.5
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 771, in create_engine_config
model_config = ModelConfig(
^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 227, in __init__
self._verify_quantization()
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 285, in _verify_quantization
quantization_override = method.override_quantization_method(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 82, in override_quantization_method
can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 127, in is_awq_marlin_compatible
return check_marlin_supported(quant_type=cls.TYPE_MAP[num_bits],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 78, in check_marlin_supported
cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 55, in _check_marlin_supported
major, minor = current_platform.get_device_capability()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 96, in get_device_capability
physical_device_id = device_id_to_physical_device_id(device_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/75ba1c81b701b8c38cfbb47599fdb4870c7c7def/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 86, in device_id_to_physical_device_id
return int(physical_device_id)
^^^^^^^^^^^^^^^^^^^^^^^
ValueError: invalid literal for int() with base 10: ''
and here's the traceback using 0.6.0
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 874, in create_engine_config
model_config = self.create_model_config()
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/engine/arg_utils.py", line 811, in create_model_config
return ModelConfig(
^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 223, in __init__
self._verify_quantization()
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/config.py", line 286, in _verify_quantization
quantization_override = method.override_quantization_method(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 83, in override_quantization_method
can_convert = cls.is_awq_marlin_compatible(hf_quant_cfg)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/awq_marlin.py", line 128, in is_awq_marlin_compatible
return check_marlin_supported(quant_type=cls.TYPE_MAP[num_bits],
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 80, in check_marlin_supported
cond, _ = _check_marlin_supported(quant_type, group_size, has_zp,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/model_executor/layers/quantization/utils/marlin_utils.py", line 56, in _check_marlin_supported
capability_tuple = current_platform.get_device_capability()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 101, in get_device_capability
physical_device_id = device_id_to_physical_device_id(device_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2024-10-02_05-28-34_844817_12/runtime_resources/pip/0bcd9ff70c46fb6cd68b24dda744d492798de5e3/virtualenv/lib/python3.11/site-packages/vllm/platforms/cuda.py", line 88, in device_id_to_physical_device_id
raise RuntimeError("CUDA_VISIBLE_DEVICES is set to empty string,"
RuntimeError: CUDA_VISIBLE_DEVICES is set to empty string, which means GPU support is disabled.
@youkaichao wanted to check if there was any update on this?
@alexdauenhauer please follow the discussion to give enough information
We observed the same problem: With vLLM 0.5.4 we have no problem deploying quantized models (e.g., this one), however, the moment we select any version greater than 0.5.4 (tested with max. 0.6.2) we get the same problem as described above.
Using the hardcoded solution proposed here works, however we wanted something less invasive, so came up with the following solution, figured it might help someone:
# ...
# All your other imports come before
import vllm.platforms.cuda
logger = logging.getLogger("ray.serve")
app = FastAPI()
# vLLM has some issues in certain versions, which is why we introduce some additional logic
# https://github.com/vllm-project/vllm/issues/7890
# https://github.com/vllm-project/vllm/issues/8402
# Goal: If possible not invasive, non-blocking
# Save a reference to the original function
original_function = vllm.platforms.cuda.device_id_to_physical_device_id
def device_id_to_physical_device_id_wrapper(*args, **kwargs):
logger.info(f"Hook: Executing code before calling "
f"'device_id_to_physical_device_id' (with args={args}, kwargs={kwargs}).")
if not len(os.environ["CUDA_VISIBLE_DEVICES"]):
try:
import nvsmi
gpu_count: int = len(list(nvsmi.get_gpus()))
new_env_value: str = ",".join([str(n) for n in range(gpu_count)])
os.environ["CUDA_VISIBLE_DEVICES"] = new_env_value
logger.info(f"New value for environment variable 'CUDA_VISIBLE_DEVICES': {new_env_value}")
except BaseException as e:
logger.error(f"Could not derive gpu_count using 'nvsmi' library. Error: {e}")
func_response = original_function(*args, **kwargs)
logger.info(f"function 'device_id_to_physical_device_id' response: {func_response}")
return func_response
# Replace the original function with the wrapped version
vllm.platforms.cuda.device_id_to_physical_device_id = device_id_to_physical_device_id_wrapper
@serve.deployment(name="VLLMDeployment")
@serve.ingress(app)
class VLLMDeployment:
# ...
So this is supposed to be inserted in / added to your vLLM deployment code. We add some logging for transparency, and try to derive the correct value of the environment variable in an automated manner using the nvsmi
library (which of course then also needs to be a dependency of your deployment, installed, for instance, via pip), however, you can also provide the required information in another way.
@mcd01 I still don't understand the problem, who sets CUDA_VISIBLE_DEVICES
to an empty string?
@youkaichao Unfortunately I did not yet have the time to investigate it further
@youkaichao for me it is due to launching vllm in a ray cluster where the head node does not have GPU attached, only the worker nodes. I'll try some of the workarounds listed here
@alexdauenhauer
launching vllm in a ray cluster where the head node does not have GPU attached
this is not supported in vllm. even if you solve this error, it will error later. you need to launch the vllm process in a node with gpus.
Your current environment
The output of `python collect_env.py`
```text Collecting environment information... PyTorch version: 2.3.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35 Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.0-118-generic-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A40 Nvidia driver version: 550.90.07 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.6 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.6 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 20 On-line CPU(s) list: 0-19 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz CPU family: 6 Model: 106 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 20 Stepping: 6 BogoMIPS: 5786.40 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 invpcid avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xsaves arat pku ospke md_clear flush_l1d arch_capabilities Hypervisor vendor: VMware Virtualization type: full L1d cache: 960 KiB (20 instances) L1i cache: 640 KiB (20 instances) L2 cache: 25 MiB (20 instances) L3 cache: 480 MiB (20 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-9 NUMA node1 CPU(s): 10-19 Vulnerability Gather data sampling: Unknown: Dependent on hypervisor status Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Versions of relevant libraries: [pip3] botorch==0.8.5 [pip3] gpytorch==1.10 [pip3] msgpack-numpy==0.4.8 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==8.9.2.26 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.5.82 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] onnx==1.15.0 [pip3] onnxruntime==1.18.0 [pip3] pynvml==11.5.0 [pip3] pytorch-lightning==1.8.6 [pip3] pytorch-ranger==0.1.1 [pip3] pyzmq==26.0.3 [pip3] tf2onnx==1.15.1 [pip3] torch==2.3.0+cu121 [pip3] torch_cluster==1.6.3+pt23cu121 [pip3] torch_geometric==2.5.3 [pip3] torch-optimizer==0.3.0 [pip3] torch_scatter==2.1.2+pt23cu121 [pip3] torch_sparse==0.6.18+pt23cu121 [pip3] torch_spline_conv==1.2.2+pt23cu121 [pip3] torchmetrics==0.10.3 [pip3] torchtext==0.18.0+cpu [pip3] torchvision==0.18.0+cu121 [pip3] transformers==4.36.2 [pip3] triton==2.3.0 [conda] botorch 0.8.5 pypi_0 pypi [conda] gpytorch 1.10 pypi_0 pypi [conda] msgpack-numpy 0.4.8 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 8.9.2.26 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.5.82 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pytorch-lightning 1.8.6 pypi_0 pypi [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] pyzmq 26.0.3 pypi_0 pypi [conda] torch 2.3.0+cu121 pypi_0 pypi [conda] torch-cluster 1.6.3+pt23cu121 pypi_0 pypi [conda] torch-geometric 2.5.3 pypi_0 pypi [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torch-scatter 2.1.2+pt23cu121 pypi_0 pypi [conda] torch-sparse 0.6.18+pt23cu121 pypi_0 pypi [conda] torch-spline-conv 1.2.2+pt23cu121 pypi_0 pypi [conda] torchmetrics 0.10.3 pypi_0 pypi [conda] torchtext 0.18.0+cpu pypi_0 pypi [conda] torchvision 0.18.0+cu121 pypi_0 pypi [conda] transformers 4.36.2 pypi_0 pypi [conda] triton 2.3.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-19 0-1 N/A Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```Model Input Dumps
No response
🐛 Describe the bug
I have 2 nodes with 1 GPU each in a Kubernetes environment. Each of those GPUs is too small to load the model (hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4) by itself, hence what I want is to run 2 replicas of vllm over ray to split the weights (PIPELINE_PARALLEL_SIZE config) between the two.
I was able to do that, using this guide. This is the command I used to launch quantized llama3.1:70b -
python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name llama3.1:70b --model hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --max-model-len 4096 --tokenizer hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 --dtype half -q marlin --tensor-parallel-size 1 --pipeline-parallel-size 2
.It successfully runs the model over 3 pods (ray head + 2 ray workers), where each worker has access to its own GPU (46GB)
However, I'm unable to do the same thing with Kuberay, launching a model via RayService. Whenever I try to launch a quantized model I get the following error:
RuntimeError: CUDA_VISIBLE_DEVICES is set to empty string, which means GPU support is disabled.
. However, I'm able to do the same thing using a non-quantized model. Together with KubeRay's maintainer we successfully launched llama3.1:8B and llama3.1:70B and were unable to launch the same models, but with marlin and gptq quantizationI have previously opened an issue in KubeRay.
Before submitting a new issue...