[Installation]: - Githubissues

ndao600 commented 1 week ago

Your current environment

The output of `python collect_env.py`

Collecting environment information... /home/miniconda3/envs/vllm/lib/python3.12/site-packages/torch/cuda/init.py:128: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0 PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.22.1 Libc version: glibc-2.35

Python version: 3.12.4 | packaged by Anaconda, Inc. | (main, Jun 18 2024, 15:12:24) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.153.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: False CUDA runtime version: 12.5.82 CUDA_MODULE_LOADING set to: N/A GPU models and configuration: GPU 0: NVIDIA RTX 4000 Ada Generation GPU 1: NVIDIA RTX 4000 Ada Generation GPU 2: NVIDIA RTX 4000 Ada Generation GPU 3: NVIDIA RTX 4000 Ada Generation GPU 4: NVIDIA RTX 4000 Ada Generation GPU 5: NVIDIA RTX 4000 Ada Generation GPU 6: NVIDIA RTX 4000 Ada Generation GPU 7: NVIDIA RTX 4000 Ada Generation

Nvidia driver version: 555.99 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: AuthenticAMD Model name: AMD Ryzen Threadripper 7960X 24-Cores CPU family: 25 Model: 24 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 1 BogoMIPS: 8387.54 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx512_bf16 clzero xsaveerptr arat npt nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold v_vmsave_vmload avx512vbmi umip avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid fsrm Virtualization: AMD-V Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 768 KiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 24 MiB (24 instances) L3 cache: 32 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.6.68 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.6.68 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: N/A vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X SYS SYS SYS SYS SYS SYS SYS N/A GPU1 SYS X SYS SYS SYS SYS SYS SYS N/A GPU2 SYS SYS X SYS SYS SYS SYS SYS N/A GPU3 SYS SYS SYS X SYS SYS SYS SYS N/A GPU4 SYS SYS SYS SYS X SYS SYS SYS N/A GPU5 SYS SYS SYS SYS SYS X SYS SYS N/A GPU6 SYS SYS SYS SYS SYS SYS X SYS N/A GPU7 SYS SYS SYS SYS SYS SYS SYS X N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

How you are installing vllm

pip install -vvv vllm

pip install vllm I was using vllm 0.5.4, which was running fine without any issues. I just upgraded to 0.6.0 and I got the above message. I have tried to delate conda env and re-installed, set cuda visible devices, cuda arch without any success. Here is the error message when I try to vllm serve: Traceback (most recent call last): File "/home/miniconda3/envs/vllm/bin/vllm", line 5, in from vllm.scripts import main File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/init.py", line 3, in from vllm.engine.arg_utils import AsyncEngineArgs, EngineArgs File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 11, in from vllm.config import (CacheConfig, DecodingConfig, DeviceConfig, File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/config.py", line 12, in from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/init.py", line 3, in from vllm.model_executor.layers.quantization.aqlm import AQLMConfig File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/model_executor/layers/quantization/aqlm.py", line 11, in from vllm import customops as ops File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/_custom_ops.py", line 10, in from vllm.platforms import current_platform File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/init.py", line 51, in from .cuda import CudaPlatform File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 79, in warn_if_different_devices() File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 41, in wrapper return fn(*args, *kwargs) File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 66, in warn_if_different_devices device_names = [get_physical_device_name(i) for i in range(device_ids)] File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 66, in device_names = [get_physical_device_name(i) for i in range(device_ids)] File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 41, in wrapper return fn(args, **kwargs) File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/vllm/platforms/cuda.py", line 59, in get_physical_device_name return pynvml.nvmlDeviceGetName(handle) File "/home/miniconda3/envs/vllm/lib/python3.10/site-packages/pynvml.py", line 2182, in wrapper return res.decode() UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte

Interestingly, I intermittently got this message when running other programs and never able to solve it UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 2: out of memory (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.) return torch._C._cuda_getDeviceCount() > 0

I would really appreciate the help to troubleshoot this.

Before submitting a new issue...

[X] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

axel7083 commented 1 week ago

I am facing the same issue while running vllm/vllm-openai:latest inside podman.

Here is the logs

$: podman run --entrypoint=bash -it --device nvidia.com/gpu=all -p 8000:8000 --ipc=host vllm/vllm-openai:latest
/vllm-workspace# python3 -m vllm.entrypoints.openai.api_server --model mistralai/Mistral-7B-v0.1
Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 236, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
  File "/usr/local/lib/python3.10/dist-packages/vllm/entrypoints/openai/rpc/server.py", line 34, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 735, in from_engine_args
    engine = cls(
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 615, in __init__
    self.engine = self._init_engine(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 835, in _init_engine
    return engine_class(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/async_llm_engine.py", line 262, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/engine/llm_engine.py", line 305, in __init__
    self.model_executor = executor_class(
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/executor_base.py", line 47, in __init__
    self._init_executor()
  File "/usr/local/lib/python3.10/dist-packages/vllm/executor/gpu_executor.py", line 39, in _init_executor
    self.driver_worker.init_device()
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 168, in init_device
    _check_if_gpu_supports_dtype(self.model_config.dtype)
  File "/usr/local/lib/python3.10/dist-packages/vllm/worker/worker.py", line 459, in _check_if_gpu_supports_dtype
    gpu_name = current_platform.get_device_name()
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 107, in get_device_name
    return get_physical_device_name(physical_device_id)
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/vllm/platforms/cuda.py", line 59, in get_physical_device_name
    return pynvml.nvmlDeviceGetName(handle)
  File "/usr/local/lib/python3.10/dist-packages/pynvml.py", line 2182, in wrapper
    return res.decode()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 0: invalid start byte
ERROR 09-09 06:00:53 api_server.py:186] RPCServer process died before responding to readiness probe

Debugging the container

inside the container I am able to reproduce the issue,

/vllm-workspace# python3
>>> import pynvml
>>> pynvml.nvmlInit()
>>> device_ids: int = pynvml.nvmlDeviceGetCount()
>>> assert device_ids > 0
>>> handle = pynvml.nvmlDeviceGetHandleByIndex(0)
>>> pynvml.nvmlDeviceGetName(handle)

Which correspond to the issue raised by the server when reaching the following line

https://github.com/vllm-project/vllm/blob/4ef41b84766670c1bd8079f58d35bf32b5bcb3ab/vllm/platforms/cuda.py#L59

Problem seems to be linked to driver version 555.X (I am currently also having this)

Solution

Updating to the latest driver 560.X fixed the problem

ndao600 commented 1 week ago

Yes. I just updated the driver and it is not fixed. Thank you.

vllm-project / vllm

[Installation]: #8255

Your current environment

How you are installing vllm

Before submitting a new issue...

Debugging the container

Related issues

Solution