[Bug] Can not run vLLM with tensor parallel

KevinWu2017 commented 1 month ago

先决条件

[X] 我已经搜索过问题和讨论但未得到预期的帮助。
[X] 错误在最新版本中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

{'CUDA available': True,
 'CUDA_HOME': '/usr/local/cuda',
 'GCC': 'gcc (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0',
 'GPU 0,1,2,3,4,5,6,7': 'NVIDIA GeForce RTX 4090',
 'MMEngine': '0.10.4',
 'MUSA available': False,
 'NVCC': 'Cuda compilation tools, release 12.4, V12.4.131',
 'OpenCV': '4.10.0',
 'PyTorch': '2.3.1+cu121',
 'PyTorch compiling details': 'PyTorch built with:\n'
                              '  - GCC 9.3\n'
                              '  - C++ Version: 201703\n'
                              '  - Intel(R) oneAPI Math Kernel Library Version '
                              '2022.2-Product Build 20220804 for Intel(R) 64 '
                              'architecture applications\n'
                              '  - Intel(R) MKL-DNN v3.3.6 (Git Hash '
                              '86e6af5974177e513fd3fee58425e1063e7f1361)\n'
                              '  - OpenMP 201511 (a.k.a. OpenMP 4.5)\n'
                              '  - LAPACK is enabled (usually provided by '
                              'MKL)\n'
                              '  - NNPACK is enabled\n'
                              '  - CPU capability usage: AVX512\n'
                              '  - CUDA Runtime 12.1\n'
                              '  - NVCC architecture flags: '
                              '-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n'
                              '  - CuDNN 8.9.2\n'
                              '  - Magma 2.6.1\n'
                              '  - Build settings: BLAS_INFO=mkl, '
                              'BUILD_TYPE=Release, CUDA_VERSION=12.1, '
                              'CUDNN_VERSION=8.9.2, '
                              'CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, '
                              'CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 '
                              '-fabi-version=11 -fvisibility-inlines-hidden '
                              '-DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO '
                              '-DLIBKINETO_NOROCTRACER -DUSE_FBGEMM '
                              '-DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK '
                              '-DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE '
                              '-O2 -fPIC -Wall -Wextra -Werror=return-type '
                              '-Werror=non-virtual-dtor -Werror=bool-operation '
                              '-Wnarrowing -Wno-missing-field-initializers '
                              '-Wno-type-limits -Wno-array-bounds '
                              '-Wno-unknown-pragmas -Wno-unused-parameter '
                              '-Wno-unused-function -Wno-unused-result '
                              '-Wno-strict-overflow -Wno-strict-aliasing '
                              '-Wno-stringop-overflow -Wsuggest-override '
                              '-Wno-psabi -Wno-error=pedantic '
                              '-Wno-error=old-style-cast -Wno-missing-braces '
                              '-fdiagnostics-color=always -faligned-new '
                              '-Wno-unused-but-set-variable '
                              '-Wno-maybe-uninitialized -fno-math-errno '
                              '-fno-trapping-math -Werror=format '
                              '-Wno-stringop-overflow, LAPACK_INFO=mkl, '
                              'PERF_WITH_AVX=1, PERF_WITH_AVX2=1, '
                              'PERF_WITH_AVX512=1, TORCH_VERSION=2.3.1, '
                              'USE_CUDA=ON, USE_CUDNN=ON, USE_CUSPARSELT=1, '
                              'USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, '
                              'USE_GLOG=OFF, USE_GLOO=ON, USE_MKL=ON, '
                              'USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, '
                              'USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, '
                              'USE_ROCM_KERNEL_ASSERT=OFF, \n',
 'Python': '3.10.14 (main, May  6 2024, 19:42:50) [GCC 11.2.0]',
 'TorchVision': '0.18.1+cu121',
 'numpy_random_seed': 2147483648,
 'opencompass': '0.2.6+c5074c0',
 'sys.platform': 'linux'}

重现问题 - 代码/配置示例

Just the built in run.py file.

重现问题 - 命令或脚本

CUDA_VISIBLE_DEVICES=4,5 python run.py --models vllm_mixtral_8x7b_v0_1 --datasets mmlu_gen -m infer --max-num-workers 1 --debug

重现问题 - 错误信息

/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/transformers/utils/hub.py:127: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
07/23 14:32:38 - OpenCompass - INFO - Loading mmlu_gen: configs/datasets/mmlu/mmlu_gen.py
07/23 14:32:38 - OpenCompass - INFO - Loading vllm_mixtral_8x7b_v0_1: configs/models/mistral/vllm_mixtral_8x7b_v0_1.py
07/23 14:32:38 - OpenCompass - INFO - Loading example: configs/summarizers/example.py
07/23 14:32:38 - OpenCompass - INFO - Current exp folder: outputs/default/20240723_143238
07/23 14:32:38 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Modules of opencompass's partitioner registry have been automatically imported from opencompass.partitioners
07/23 14:32:38 - OpenCompass - DEBUG - Get class `NumWorkerPartitioner` from "partitioner" registry in "opencompass"
07/23 14:32:38 - OpenCompass - DEBUG - An `NumWorkerPartitioner` instance is built from registry, and its implementation can be found in opencompass.partitioners.num_worker
07/23 14:32:38 - OpenCompass - DEBUG - Key eval.runner.task.judge_cfg not found in config, ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Key eval.runner.task.dump_details not found in config, ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Key eval.given_pred not found in config, ignored.
07/23 14:32:38 - OpenCompass - DEBUG - Additional config: {}
07/23 14:32:38 - OpenCompass - INFO - Partitioned into 1 tasks.
07/23 14:32:38 - OpenCompass - DEBUG - Task 0: [mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_biology,mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_chemistry]
07/23 14:32:38 - OpenCompass - DEBUG - Modules of opencompass's runner registry have been automatically imported from opencompass.runners
07/23 14:32:38 - OpenCompass - DEBUG - Get class `LocalRunner` from "runner" registry in "opencompass"
07/23 14:32:38 - OpenCompass - DEBUG - An `LocalRunner` instance is built from registry, and its implementation can be found in opencompass.runners.local
07/23 14:32:38 - OpenCompass - DEBUG - Modules of opencompass's task registry have been automatically imported from opencompass.tasks
07/23 14:32:38 - OpenCompass - DEBUG - Get class `OpenICLInferTask` from "task" registry in "opencompass"
07/23 14:32:38 - OpenCompass - DEBUG - An `OpenICLInferTask` instance is built from registry, and its implementation can be found in opencompass.tasks.openicl_infer
07/23 14:32:39 - OpenCompass - INFO - Task [mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_biology,mixtral-8x7b-v0.1-vllm/lukaemon_mmlu_college_chemistry]
07/23 14:32:39 - OpenCompass - DEBUG - Modules of opencompass's model registry have been automatically imported from opencompass.models
07/23 14:32:39 - OpenCompass - DEBUG - Get class `VLLM` from "model" registry in "opencompass"
INFO 07-23 14:32:39 config.py:695] Defaulting to use mp for distributed inference
INFO 07-23 14:32:39 llm_engine.py:174] Initializing an LLM engine (v0.5.2) with config: model='mistralai/Mixtral-8x7B-v0.1', speculative_config=None, tokenizer='mistralai/Mixtral-8x7B-v0.1', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=2, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=mistralai/Mixtral-8x7B-v0.1, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 07-23 14:32:40 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
(VllmWorkerProcess pid=2577027) Process VllmWorkerProcess:
(VllmWorkerProcess pid=2577027) Traceback (most recent call last):
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
(VllmWorkerProcess pid=2577027)     self.run()
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=2577027)     self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/executor/multiproc_worker_utils.py", line 210, in _run_worker_process
(VllmWorkerProcess pid=2577027)     worker = worker_factory()
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 70, in _create_worker
(VllmWorkerProcess pid=2577027)     wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 326, in init_worker
(VllmWorkerProcess pid=2577027)     self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/worker/worker.py", line 90, in __init__
(VllmWorkerProcess pid=2577027)     self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 229, in __init__
(VllmWorkerProcess pid=2577027)     self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/attention/selector.py", line 45, in get_attn_backend
(VllmWorkerProcess pid=2577027)     backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/vllm/attention/selector.py", line 148, in which_attn_to_use
(VllmWorkerProcess pid=2577027)     if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/cuda/__init__.py", line 430, in get_device_capability
(VllmWorkerProcess pid=2577027)     prop = get_device_properties(device)
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/cuda/__init__.py", line 444, in get_device_properties
(VllmWorkerProcess pid=2577027)     _lazy_init()  # will define _get_device_properties
(VllmWorkerProcess pid=2577027)   File "/home/cpwu/miniconda3/envs/opencompass/lib/python3.10/site-packages/torch/cuda/__init__.py", line 279, in _lazy_init
(VllmWorkerProcess pid=2577027)     raise RuntimeError(
(VllmWorkerProcess pid=2577027) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
ERROR 07-23 14:32:41 multiproc_worker_utils.py:120] Worker VllmWorkerProcess pid 2577027 died, exit code: 1
INFO 07-23 14:32:41 multiproc_worker_utils.py:123] Killing local vLLM worker processes

其他信息

No response

Mor-Li commented 1 month ago

Hi, the issue appears to be due to vLLM's inability to run the Mixtral model internally, rather than an issue with OpenCompass. I suggest trying to create a minimal reproducible script that excludes OpenCompass components. Instead, write a simple Python file to run this model using vLLM and see if it can be loaded successfully.

KevinWu2017 commented 1 month ago

Thank you for your reply, I created a minimal reproducible script vllm_mixtral.py

from vllm import LLM, SamplingParams

# Sample prompts.
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
# Create a sampling params object.
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

# Create an LLM.
llm = LLM(model="mistralai/Mixtral-8x7B-v0.1", tensor_parallel_size=8, download_dir="/home/data/huggingface", gpu_memory_utilization=0.9)

# Generate texts from the prompts. The output is a list of RequestOutput objects
# that contain the prompt, generated text, and other information.
outputs = llm.generate(prompts, sampling_params)
# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

and run it with command HF_HUB_OFFLINE=1 python vllm_mixtral.py and it successfully execute the model.

KevinWu2017 commented 1 month ago

After trying more models, it appears that this issue seems to be related to tensor parallelism. When adjusting the configuration file config/models/qwen/vllm_qwen1_5_moe_a2_7b.py and set tensor_parallel_size=2 and num_gpus=2, the same issue occurred.

Mor-Li commented 1 month ago

Thank you for reporting the issue. To resolve this, try modify the tensor parallel parameter in the configuration file configs/models/mistral/vllm_mixtral_8x7b_v0_1.py to tensor_parallel_size=8. This change may enable the model to run correctly.

KevinWu2017 commented 1 month ago

After modified the configs/models/mistral/vllm_mixtral_8x7b_v0_1.py file with tensor_parallel_size=8 and num_gpus=8 And run with command python run.py --models vllm_mixtral_8x7b_v0_1 --datasets mmlu_gen -m infer --max-num-workers 1 The log still shows the same problem.

Is there a specific environment version that can successfully run with tensor parallel? Are there any vllm, torch or opencompass version requirements?

KevinWu2017 commented 1 month ago

After some searching, this should be caused by a behavior change of vLLM since vllm-0.5.1. As metioned here: https://github.com/vllm-project/vllm/pull/5669#issuecomment-2181625739. So an easy workaround is using VLLM_WORKER_MULTIPROC_METHOD=spawn prior to the python run.py command.

open-compass / opencompass