vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
23.77k stars 3.41k forks source link

[Bug]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight' #3900

Open guangweiShaw opened 3 months ago

guangweiShaw commented 3 months ago

Your current environment

3MIO:~/vllm$ python collect_env.py Collecting environment information... PyTorch version: 2.1.2+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.2 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: Could not collect CMake version: version 3.29.0 Libc version: glibc-2.35

Python version: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.15.146.1-microsoft-standard-WSL2-x86_64-with-glibc2.35 Is CUDA available: True CUDA runtime version: 12.1.105 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA GeForce RTX 2080 Ti Nvidia driver version: 551.23 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Vendor ID: AuthenticAMD Model name: AMD Ryzen 5 5500 CPU family: 25 Model: 80 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 Stepping: 0 BogoMIPS: 7186.23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl tsc_reliable nonstop_tsc cpuid extd_apicid pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr arat umip vaes vpclmulqdq rdpid fsrm Hypervisor vendor: Microsoft Virtualization type: full L1d cache: 64 KiB (2 instances) L1i cache: 64 KiB (2 instances) L2 cache: 1 MiB (2 instances) L3 cache: 16 MiB (1 instance) Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] torch==2.1.2 [pip3] triton==2.1.0 [conda] numpy 1.26.4 pypi_0 pypi [conda] torch 2.1.2 pypi_0 pypi [conda] triton 2.1.0 pypi_0 pypiROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.4.0.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

GPU : 2080ti 22G. Model: Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4

🐛 Describe the bug

from vllm import LLM, SamplingParams

prompts = [ "Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is", ] sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4")

outputs = llm.generate(prompts, sampling_params)

for output in outputs: prompt = output.prompt generated_text = output.outputs[0].text print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

3MIO:~$ python test.py WARNING 04-08 00:16:41 config.py:211] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models. INFO 04-08 00:16:41 llm_engine.py:74] Initializing an LLM engine (v0.4.0.post1) with config: model='Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4', tokenizer='Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, seed=0) WARNING 04-08 00:16:41 config.py:406] Possibly too large swap space. 4.00 GiB out of the 9.72 GiB total CPU memory is allocated for the swap space. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. WARNING 04-08 00:16:42 utils.py:357] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. INFO 04-08 00:16:42 selector.py:40] Cannot use FlashAttention backend for Volta and Turing GPUs. INFO 04-08 00:16:42 selector.py:25] Using XFormers backend. Traceback (most recent call last): File "/home/xyq346708/test.py", line 46, in llm = LLM(model="Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4") File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/entrypoints/llm.py", line 112, in init self.llm_engine = LLMEngine.from_engine_args( File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 196, in from_engine_args engine = cls( File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/engine/llm_engine.py", line 110, in init self.model_executor = executor_class(model_config, cache_config, File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 37, in init self._init_worker() File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/executor/gpu_executor.py", line 66, in _init_worker self.driver_worker.load_model() File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/worker/worker.py", line 107, in load_model self.model_runner.load_model() File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/worker/model_runner.py", line 95, in load_model self.model = get_model( File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/model_loader.py", line 91, in get_model model = model_class(model_config.hf_config, linear_method) File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_moe.py", line 378, in init self.model = Qwen2MoeModel(config, linear_method) File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_moe.py", line 342, in init self.layers = nn.ModuleList([ File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_moe.py", line 343, in Qwen2MoeDecoderLayer(config, File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_moe.py", line 284, in init self.mlp = Qwen2MoeSparseMoeBlock(config=config, File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in init self.pack_params() File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params w1.append(expert.gate_up_proj.weight) File "/home/xyq346708/miniconda3/envs/moe6/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1695, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'

Jotakak-yu commented 3 months ago

same issue

chu-tianxiang commented 3 months ago

Quantized MoE models except for Mixtral (e.g. Deepseek, Qwen-MoE, DBRX) are not supported by vLLM at the moment.

wellcasa commented 3 months ago

When can vllm support qwen2moe

wellcasa commented 3 months ago

qwen2moe-gpt-int4

BarryRun commented 1 month ago

As the new model Qwen2-57B-A14B-Instruct-GPTQ-Int4 is released, when will vLLM support these Quantized MoE models?

gree2 commented 1 month ago

command line

(vllm043) ailearn@gpts:/data/sdb/models$ cd /data/sdb/models/ ; CUDA_VISIBLE_DEVICES=0,1,2,3 python -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --port 8000 --max-model-len 21232 --model Qwen2-57B-A14B-Instruct-GPTQ-Int4 --served-model-name qwen --quantization gptq --dtype half --max-num-seqs 16 --enforce-eager --kv-cache-dtype fp8 --tensor-parallel-size 4

output

INFO 06-08 13:05:21 config.py:390] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
2024-06-08 13:05:24,440 INFO worker.py:1753 -- Started a local Ray instance.
INFO 06-08 13:05:25 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=21232, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, disable_custom_all_reduce=False, quantization=gptq_marlin, enforce_eager=True, kv_cache_dtype=fp8, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=qwen)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-08 13:05:35 selector.py:130] Cannot use FlashAttention-2 backend for FP8 KV cache.
INFO 06-08 13:05:35 selector.py:51] Using XFormers backend.
(RayWorkerWrapper pid=179233) INFO 06-08 13:05:35 selector.py:130] Cannot use FlashAttention-2 backend for FP8 KV cache.
(RayWorkerWrapper pid=179233) INFO 06-08 13:05:35 selector.py:51] Using XFormers backend.
INFO 06-08 13:05:38 utils.py:618] Found nccl from library libnccl.so.2
INFO 06-08 13:05:38 pynccl.py:65] vLLM is using nccl==2.20.5
(RayWorkerWrapper pid=178808) INFO 06-08 13:05:38 utils.py:618] Found nccl from library libnccl.so.2
(RayWorkerWrapper pid=178808) INFO 06-08 13:05:38 pynccl.py:65] vLLM is using nccl==2.20.5
WARNING 06-08 13:05:38 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(RayWorkerWrapper pid=178808) WARNING 06-08 13:05:38 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 06-08 13:05:39 selector.py:130] Cannot use FlashAttention-2 backend for FP8 KV cache.
INFO 06-08 13:05:39 selector.py:51] Using XFormers backend.
ERROR 06-08 13:05:39 worker_base.py:148] Error executing method load_model. This might cause deadlock in distributed execution.
ERROR 06-08 13:05:39 worker_base.py:148] Traceback (most recent call last):
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
ERROR 06-08 13:05:39 worker_base.py:148]     return executor(*args, **kwargs)
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
ERROR 06-08 13:05:39 worker_base.py:148]     self.model_runner.load_model()
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
ERROR 06-08 13:05:39 worker_base.py:148]     self.model = get_model(
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
ERROR 06-08 13:05:39 worker_base.py:148]     return loader.load_model(model_config=model_config,
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
ERROR 06-08 13:05:39 worker_base.py:148]     model = _initialize_model(model_config, self.load_config,
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
ERROR 06-08 13:05:39 worker_base.py:148]     return model_class(config=model_config.hf_config,
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in __init__
ERROR 06-08 13:05:39 worker_base.py:148]     self.model = Qwen2MoeModel(config, cache_config, quant_config)
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in __init__
ERROR 06-08 13:05:39 worker_base.py:148]     self.layers = nn.ModuleList([
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in <listcomp>
ERROR 06-08 13:05:39 worker_base.py:148]     Qwen2MoeDecoderLayer(config,
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in __init__
ERROR 06-08 13:05:39 worker_base.py:148]     self.mlp = Qwen2MoeSparseMoeBlock(config=config,
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__
ERROR 06-08 13:05:39 worker_base.py:148]     self.pack_params()
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
ERROR 06-08 13:05:39 worker_base.py:148]     w1.append(expert.gate_up_proj.weight)
ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
ERROR 06-08 13:05:39 worker_base.py:148]     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
ERROR 06-08 13:05:39 worker_base.py:148] AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148] Error executing method load_model. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148] Traceback (most recent call last):
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     return executor(*args, **kwargs)
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     self.model_runner.load_model()
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     self.model = get_model(
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     return loader.load_model(model_config=model_config,
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     model = _initialize_model(model_config, self.load_config,
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     return model_class(config=model_config.hf_config,
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in __init__
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     self.model = Qwen2MoeModel(config, cache_config, quant_config)
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in __init__
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     self.layers = nn.ModuleList([
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in <listcomp>
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     Qwen2MoeDecoderLayer(config,
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in __init__
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     self.mlp = Qwen2MoeSparseMoeBlock(config=config,
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     self.pack_params()
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     w1.append(expert.gate_up_proj.weight)
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148]     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
(RayWorkerWrapper pid=178808) ERROR 06-08 13:05:39 worker_base.py:148] AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 317, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in __init__
[rank0]:     super().__init__(*args, **kwargs)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]:     self._init_workers_ray(placement_group)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 172, in _init_workers_ray
[rank0]:     self._run_workers("load_model",
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]:     driver_worker_output = self.driver_worker.execute_method(
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]:     raise e
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method
[rank0]:     return executor(*args, **kwargs)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in __init__
[rank0]:     self.model = Qwen2MoeModel(config, cache_config, quant_config)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in <listcomp>
[rank0]:     Qwen2MoeDecoderLayer(config,
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in __init__
[rank0]:     self.mlp = Qwen2MoeSparseMoeBlock(config=config,
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__
[rank0]:     self.pack_params()
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
[rank0]:     w1.append(expert.gate_up_proj.weight)
[rank0]:   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?
(RayWorkerWrapper pid=179233) INFO 06-08 13:05:39 selector.py:130] Cannot use FlashAttention-2 backend for FP8 KV cache. [repeated 5x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(RayWorkerWrapper pid=179233) INFO 06-08 13:05:39 selector.py:51] Using XFormers backend. [repeated 5x across cluster]
(RayWorkerWrapper pid=179233) INFO 06-08 13:05:38 utils.py:618] Found nccl from library libnccl.so.2 [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) INFO 06-08 13:05:38 pynccl.py:65] vLLM is using nccl==2.20.5 [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) WARNING 06-08 13:05:38 custom_all_reduce.py:158] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly. [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148] Error executing method load_model. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 140, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     return executor(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model [repeated 6x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     self.model_runner.load_model() [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     self.model = get_model( [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     return loader.load_model(model_config=model_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     model = _initialize_model(model_config, self.load_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     return model_class(config=model_config.hf_config, [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__ [repeated 8x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     self.model = Qwen2MoeModel(config, cache_config, quant_config) [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     self.layers = nn.ModuleList([ [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in <listcomp> [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     w1.append(expert.gate_up_proj.weight) [repeated 4x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     self.mlp = Qwen2MoeSparseMoeBlock(config=config, [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     self.pack_params() [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]   File "/home/ailearn/.conda/envs/vllm043/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__ [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148]     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'") [repeated 2x across cluster]
(RayWorkerWrapper pid=179233) ERROR 06-08 13:05:39 worker_base.py:148] AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight' [repeated 2x across cluster]
(vllm043) ailearn@gpts:/data/sdb/models$
whk6688 commented 1 month ago

+1

zifeiyu-tan commented 1 month ago

+1

LSC527 commented 1 week ago

Quantized MoE models except for Mixtral (e.g. Deepseek, Qwen-MoE, DBRX) are not supported by vLLM at the moment.

@chu-tianxiang any plans on supporting this? Thanks.