vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.66k stars 3.91k forks source link

[Bug]: _pickle.UnpicklingError: invalid load key, 'W' when initializing distributed environment with vllm 0.5.5 #7846

Closed Mr-KenLee closed 2 weeks ago

Mr-KenLee commented 2 weeks ago

Your current environment

The output of `python collect_env.py` ```text Collecting environment information... WARNING 08-25 16:33:04 cuda.py:22] You are using a deprecated `pynvml` package. Please install `nvidia-ml-py` instead. See https://pypi.org/project/pynvml for more information. PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A OS: Ubuntu 22.04.3 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.30.2 Libc version: glibc-2.35 Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.14.105-1-tlinux3-0013-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: A100-SXM4-40GB GPU 1: A100-SXM4-40GB Nvidia driver version: 450.80.02 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.4 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.4 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.4 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: 架构: x86_64 CPU 运行模式: 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual 字节序: Little Endian CPU: 192 在线 CPU 列表: 0-191 厂商 ID: AuthenticAMD 型号名称: AMD EPYC 7K62 48-Core Processor CPU 系列: 23 型号: 49 每个核的线程数: 2 每个座的核数: 48 座: 2 步进: 0 Frequency boost: enabled CPU 最大 MHz: 2600.0000 CPU 最小 MHz: 1500.0000 BogoMIPS: 5189.76 标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate sme ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip rdpid overflow_recov succor smca 虚拟化: AMD-V L1d 缓存: 3 MiB (96 instances) L1i 缓存: 3 MiB (96 instances) L2 缓存: 48 MiB (96 instances) L3 缓存: 384 MiB (24 instances) NUMA 节点: 2 NUMA 节点0 CPU: 0-47,96-143 NUMA 节点1 CPU: 48-95,144-191 Vulnerability L1tf: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; __user pointer sanitization Vulnerability Spectre v2: Vulnerable, IBPB: disabled, STIBP: disabled Versions of relevant libraries: [pip3] numpy==1.26.3 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvjitlink-cu12==12.1.105 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchaudio==2.3.0+cu121 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] triton==3.0.0 [conda] numpy 1.26.3 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.1.105 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchaudio 2.3.0+cu121 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.5@09c7792610ada9f88bbf87d32b472dd44bf23cc2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 mlx5_0 mlx5_1 mlx5_2 mlx5_3 mlx5_4 mlx5_5 mlx5_6 mlx5_7 mlx5_8 mlx5_9 mlx5_10 mlx5_11 mlx5_12 mlx5_13 mlx5_14 mlx5_15 mlx5_16 mlx5_17 mlx5_18 mlx5_19 mlx5_20 mlx5_21 mlx5_22 mlx5_23 mlx5_24 mlx5_25 CPU Affinity NUMA Affinity GPU0 X NV12 SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-95,144-191 1 GPU1 NV12 X SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYSSYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS SYS 48-95,144-191 1 mlx5_0 SYS SYS X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_1 SYS SYS PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_2 SYS SYS PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_3 SYS SYS PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_4 SYS SYS PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_5 SYS SYS PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_6 SYS SYS PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_7 SYS SYS PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_8 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_9 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_10 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_11 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_12 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_13 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_14 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_15 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_16 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX PIX mlx5_17 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX PIX mlx5_18 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX PIX mlx5_19 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX PIX mlx5_20 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX PIX mlx5_21 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX PIX mlx5_22 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX PIX mlx5_23 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX PIX mlx5_24 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X PIX mlx5_25 SYS SYS PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIXPIX PIX PIX PIX PIX PIX PIX PIX PIX PIX PIX X Legend: X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks ```

🐛 Describe the bug

I attempted to use vllm==0.5.5 and ran the following script:

python -m vllm.entrypoints.openai.api_server \
    --model $MODEL_PATH \
    --tensor-parallel-size 2 \
    --seed 0 \
    --gpu-memory-utilization 0.7

However, I encountered the following error while loading the model:

ERROR 08-25 16:17:39 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 1049 died, exit code: -15
INFO 08-25 16:17:39 multiproc_worker_utils.py:123] Killing local vLLM worker processes
Traceback (most recent call last):
  File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/root/miniconda3/lib/python3.11/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 230, in run_rpc_server
    server = AsyncEngineRPCServer(async_engine_args, usage_context, rpc_path)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/entrypoints/openai/rpc/server.py", line 31, in __init__
    self.engine = AsyncLLMEngine.from_engine_args(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 740, in from_engine_args
    engine = cls(
             ^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 636, in __init__
    self.engine = self._init_engine(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 840, in _init_engine
    return engine_class(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 272, in __init__
    super().__init__(*args, **kwargs)
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/engine/llm_engine.py", line 270, in __init__
    self.model_executor = executor_class(
                          ^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 215, in __init__
    super().__init__(*args, **kwargs)
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
    super().__init__(*args, **kwargs)
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/executor_base.py", line 46, in __init__
    self._init_executor()
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 137, in _init_executor
    self._run_workers("init_device")
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/executor/multiproc_gpu_executor.py", line 192, in _run_workers
    driver_worker_output = driver_worker_method(*args, **kwargs)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 175, in init_device
    init_worker_distributed_environment(self.parallel_config, self.rank,
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/worker/worker.py", line 450, in init_worker_distributed_environment
    ensure_model_parallel_initialized(parallel_config.tensor_parallel_size,
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 965, in ensure_model_parallel_initialized
    initialize_model_parallel(tensor_model_parallel_size,
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 931, in initialize_model_parallel
    _TP = init_model_parallel_group(group_ranks,
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 773, in init_model_parallel_group
    return GroupCoordinator(
           ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/parallel_state.py", line 164, in __init__
    self.ca_comm = CustomAllreduce(
                   ^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 130, in __init__
    if not _can_p2p(rank, world_size):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce.py", line 31, in _can_p2p
    if not gpu_p2p_access_check(rank, i):
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.11/site-packages/vllm/distributed/device_communicators/custom_all_reduce_utils.py", line 227, in gpu_p2p_access_check
    result = pickle.loads(returned.stdout)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
_pickle.UnpicklingError: invalid load key, 'W'.
ERROR 08-25 16:17:42 api_server.py:171] RPCServer process died before responding to readiness probe

Interestingly, when I switched to vllm==0.5.4, it loaded successfully. After that, if I switch back to vllm==0.5.5, it also loads successfully. Could you please explain why this is happening?

Before submitting a new issue...

youkaichao commented 2 weeks ago

you can see this warning:

WARNING 08-25 16:33:04 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead. See https://pypi.org/project/pynvml for more information.

I think you should uninstall pynvml .

the bug you mention, is trying to create a p2p cache file. when you switch to 0.5.4, the file is generated successfully. and when you switch to 0.5.5 , that part of the code does not execute anymore.

youkaichao commented 2 weeks ago

I improved the warning message in https://github.com/vllm-project/vllm/pull/7852 , please take a look @Mr-KenLee

youkaichao commented 2 weeks ago

and https://github.com/vllm-project/vllm/pull/7853 should fix this problem. @Mr-KenLee please have a try.

Mr-KenLee commented 2 weeks ago

@youkaichao thank you very much! I will have a try immediately.