vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
28.49k stars 4.22k forks source link

[Bug]: Qwen2 MoE: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'? #5343

Closed geekwish closed 4 months ago

geekwish commented 4 months ago

Your current environment

Collecting environment information...
PyTorch version: 2.3.0+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: 14.0.0-1ubuntu1.1
CMake version: version 3.29.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA GeForce RTX 2080 Ti
GPU 1: NVIDIA GeForce RTX 2080 Ti
GPU 2: NVIDIA GeForce RTX 2080 Ti
GPU 3: NVIDIA GeForce RTX 4090

Nvidia driver version: 535.154.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
架构:                              x86_64
CPU 运行模式:                      32-bit, 64-bit
Address sizes:                      43 bits physical, 48 bits virtual
字节序:                            Little Endian
CPU:                                64
在线 CPU 列表:                     0-63
厂商 ID:                           AuthenticAMD
型号名称:                          AMD Ryzen Threadripper 2990WX 32-Core Processor
CPU 系列:                          23
型号:                              8
每个核的线程数:                    2
每个座的核数:                      32
座:                                1
步进:                              2
Frequency boost:                    enabled
CPU 最大 MHz:                      3000.0000
CPU 最小 MHz:                      2200.0000
BogoMIPS:                          5999.32
标记:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
虚拟化:                            AMD-V
L1d 缓存:                          1 MiB (32 instances)
L1i 缓存:                          2 MiB (32 instances)
L2 缓存:                           16 MiB (32 instances)
L3 缓存:                           64 MiB (8 instances)
NUMA 节点:                         4
NUMA 节点0 CPU:                    0-7,32-39
NUMA 节点1 CPU:                    16-23,48-55
NUMA 节点2 CPU:                    8-15,40-47
NUMA 节点3 CPU:                    24-31,56-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Mitigation; untrained return thunk; SMT vulnerable
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.20.5
[pip3] torch==2.3.0
[pip3] transformers==4.41.2
[pip3] triton==2.3.0
[conda] Could not collect
ROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.3
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    GPU2    GPU3    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PHB     SYS     SYS     0-7,32-39       0               N/A
GPU1    PHB      X      SYS     SYS     0-7,32-39       0               N/A
GPU2    SYS     SYS      X      PHB     8-15,40-47      2               N/A
GPU3    SYS     SYS     PHB      X      8-15,40-47      2               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

I have an error when run Qwen2-57B-A14B-Instruct-GPTQ-Int4 with vllm. This seems to be a problem with the code.

$ CUDA_VISIBLE_DEVICES=1,2 python -m vllm.entrypoints.openai.api_server \
    --served-model-name Qwen2-57B-A14B-Instruct-GPTQ-Int4 \
    --model /AIGC/Qwen/hf/Qwen2-57B-A14B-Instruct-GPTQ-Int4

Output:

WARNING 06-07 16:41:21 config.py:213] gptq quantization is not fully optimized yet. The speed can be slower than non-quantized models.
INFO 06-07 16:41:21 llm_engine.py:161] Initializing an LLM engine (v0.4.3) with config: model='/AIGC/Qwen/hf/Qwen2-57B-A14B-Instruct-GPTQ-Int4', speculative_config=None, tokenizer='/AIGC/Qwen/hf/Qwen2-57B-A14B-Instruct-GPTQ-Int4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=32768, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=gptq, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0, served_model_name=Qwen2-57B-A14B-Instruct-GPTQ-Int4)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 06-07 16:41:22 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-07 16:41:22 selector.py:51] Using XFormers backend.
INFO 06-07 16:41:23 selector.py:120] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 06-07 16:41:23 selector.py:51] Using XFormers backend.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
[rank0]:     return _run_code(code, main_globals, None,
[rank0]:   File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
[rank0]:     exec(code, run_globals)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 186, in <module>
[rank0]:     engine = AsyncLLMEngine.from_engine_args(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 386, in from_engine_args
[rank0]:     engine = cls(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 340, in __init__
[rank0]:     self.engine = self._init_engine(*args, **kwargs)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 462, in _init_engine
[rank0]:     return engine_class(*args, **kwargs)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 222, in __init__
[rank0]:     self.model_executor = executor_class(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
[rank0]:     self._init_executor()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/executor/gpu_executor.py", line 24, in _init_executor
[rank0]:     self.driver_worker.load_model()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/worker/worker.py", line 121, in load_model
[rank0]:     self.model_runner.load_model()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/worker/model_runner.py", line 134, in load_model
[rank0]:     self.model = get_model(
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/__init__.py", line 21, in get_model
[rank0]:     return loader.load_model(model_config=model_config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 240, in load_model
[rank0]:     model = _initialize_model(model_config, self.load_config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/model_loader/loader.py", line 91, in _initialize_model
[rank0]:     return model_class(config=model_config.hf_config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 389, in __init__
[rank0]:     self.model = Qwen2MoeModel(config, cache_config, quant_config)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 349, in __init__
[rank0]:     self.layers = nn.ModuleList([
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 350, in <listcomp>
[rank0]:     Qwen2MoeDecoderLayer(config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 290, in __init__
[rank0]:     self.mlp = Qwen2MoeSparseMoeBlock(config=config,
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 114, in __init__
[rank0]:     self.pack_params()
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/vllm/model_executor/models/qwen2_moe.py", line 138, in pack_params
[rank0]:     w1.append(expert.gate_up_proj.weight)
[rank0]:   File "/AIGC/venvs/vllm/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1709, in __getattr__
[rank0]:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
[rank0]: AttributeError: 'MergedColumnParallelLinear' object has no attribute 'weight'. Did you mean: 'qweight'?
robertgshaw2-neuralmagic commented 4 months ago

We do not currently support quantization for MoE's except Mixtral

robertgshaw2-neuralmagic commented 4 months ago

Im going to make an RFC for this

jcxcer commented 4 months ago

+1