vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
26.72k stars 3.91k forks source link

[Bug]: online fp8 quantization with jais model got assert error due to cutlass_scaled_mm() #7550

Open ZJLi2013 opened 1 month ago

ZJLi2013 commented 1 month ago

Your current environment

The output of `python collect_env.py` Nvidia driver version: 555.42.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 384 On-line CPU(s) list: 0-383 Thread(s) per core: 2 Core(s) per socket: 96 Socket(s): 2 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 25 Model: 17 Model name: AMD EPYC 9654 96-Core Processor Stepping: 1 Frequency boost: enabled CPU MHz: 1500.000 CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4799.99 Virtualization: AMD-V L1d cache: 6 MiB L1i cache: 6 MiB L2 cache: 192 MiB L3 cache: 768 MiB NUMA node0 CPU(s): 0-95,192-287 NUMA node1 CPU(s): 96-191,288-383 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; Safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Versions of relevant libraries: [pip3] flashinfer==0.1.2+cu121torch2.4 [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] pyzmq==26.1.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.0 [pip3] triton==3.0.0 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.4@ vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled

🐛 Describe the bug

run throughput benchmark.py with jais-13B/30B models with following command:

 python3 /vllm/benchmarks/benchmark_throughput.py --model core42/jais-13b-chat  --num-prompts $req -tp $tp --distributed-executor-backend mp --input-len $inp --output-len $out --trust-remote-code --dtype auto --enforce-eager   --quantization fp8

error logs as:

[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 206, in forward
[rank0]:     feed_forward_hidden_states = self.mlp(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/models/jais.py", line 162, in forward
[rank0]:     hidden_states2, _ = self.c_fc2(hidden_states)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/linear.py", line 355, in forward
[rank0]:     output_parallel = self.quant_method.apply(self, input_, bias)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/fp8.py", line 240, in apply
[rank0]:     return apply_fp8_linear(
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/quantization/utils/w8a8_utils.py", line 126, in apply_fp8_linear
[rank0]:     return ops.cutlass_scaled_mm(qinput,
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 32, in wrapper
[rank0]:     return fn(*args, **kwargs)
[rank0]:   File "/usr/local/lib/python3.10/dist-packages/vllm/_custom_ops.py", line 242, in cutlass_scaled_mm
[rank0]:     assert (b.shape[0] % 16 == 0 and b.shape[1] % 16 == 0)
mgoin commented 4 weeks ago

@tlrmchlsmth @varun-sundar-rabindranath looks like we have a model with weight shape not divisible by 16 😬

From the config, looks like it is probably this param "n_inner": 13653,