vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs
https://docs.vllm.ai
Apache License 2.0
27.86k stars 4.11k forks source link

[Bug]: Meet conflicts when using AutoAWQ marlin methods and vLLM #6985

Closed jokmingwong closed 2 months ago

jokmingwong commented 2 months ago

Your current environment

yTorch version: 2.3.1+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

Python version: 3.11.6 (main, Jun 19 2024, 15:40:26) (64-bit runtime) Python platform: Linux-5.4.241-1-tlinux4-0017.1-x86_64-with-glibc2.38 Is CUDA available: True CUDA runtime version: Could not collect CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A10 Nvidia driver version: 545.23.06 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 56 On-line CPU(s) list: 0-55 Vendor ID: AuthenticAMD Model name: AMD EPYC 7K83 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 28 Socket(s): 1 Stepping: 0 BogoMIPS: 5090.43 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 arat fsrm Hypervisor vendor: KVM Virtualization type: full L1d cache: 896 KiB (28 instances) L1i cache: 896 KiB (28 instances) L2 cache: 14 MiB (28 instances) L3 cache: 128 MiB (4 instances) NUMA node(s): 1 NUMA node0 CPU(s): 0-55 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec store bypass: Vulnerable Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.26.4 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] torch==2.3.1 [pip3] torchvision==0.18.1 [pip3] transformers==4.42.4 [pip3] triton==2.3.1 [conda] Could not collect ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.5.3.post1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-55 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

🐛 Describe the bug

When I used AutoAWQ marlin method to quantize the model, I must set "zero_point" variable "False" like codes below. Otherwise, I will meet the error on the line https://github.com/casper-hansen/AutoAWQ/blob/main/awq/modules/linear/marlin.py#L116 if I set "zero_point" variable True.

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

model_path = 'mistralai/Mistral-7B-Instruct-v0.2'
quant_path = 'mistral-instruct-v0.2-awq'
# To use Marlin, you must specify zero point as False and version as Marlin.
quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" } # Changed here <<<<<

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    model_path, **{"low_cpu_mem_usage": True, "use_cache": False}
)
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

# Quantize
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
model.save_quantized(quant_path)
tokenizer.save_pretrained(quant_path)

print(f'Model is quantized and saved at "{quant_path}"')

However, vLLM cannot support awq marlin when NOT having zero points (has_zp=False) considering the "is_sym" is always "False" here: https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L46-L48 https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/layers/quantization/utils/marlin_utils.py#L87-L91

My scripts using vLLM benchmark is:

python3 vllm/benchmarks/benchmark_throughput.py --model {awq_marlin_path} --quantization awq_marlin --input-len 4096 --output-len 128

So, it is confusing when I used AutoAWQ to quantize the model with marlin method and deploy it on the vLLM. Is the vLLM cannot support the model using AutoAWQ marlin? Or did I make some mistakes when using the scirpts above? Thanks for the answer!

wanzhenchn commented 2 months ago

The problems above also occurred in my tests.

The models quantized with AutoAWQ packages following the doc: https://github.com/casper-hansen/AutoAWQ/blob/main/docs/examples.md

quant_config = { "zero_point": False, "q_group_size": 128, "w_bit": 4, "version": "Marlin" }

Then load the quantized models loaded with vllm by passing --quantization awq_marlin

image

mgoin commented 2 months ago

Please do not specify --quantization awq_marlin or any quantization scheme, vLLM will automatically detect this. The issue should go away if you remove the quantization argument.

jokmingwong commented 2 months ago

I delete the quantization scheme and the command is:

python3 vllm/benchmarks/benchmark_throughput.py --model {awq_marlin_path} --input-len 4096 --output-len 128

vLLM will specify the configuration generated by AutoAWQ as "awq" rather than "awq_marlin", which is {"quant_method": "awq", "version": "Marlin", "zero_point": false}. And a RuntimeError will occur as shown in the figure below:

image

When I manually changed the model configuration "quant_method" from "awq" to "awq_marlin", the same issue occurred. :

image

NOTE: The benchmark_throughput.py mentioned above is the same file as vllm/benchmarks/benchmark_throughput.py.

mgoin commented 2 months ago

# To use Marlin, you must specify zero point as False and version as Marlin.

This is not true anymore. We take care of converting any AWQ format model into Marlin format inside of vLLM. Please don't use a different "version" or disable zero point. Just produce a regular AWQ checkpoint as you would before Marlin.

Shawn314 commented 2 months ago

+1, any updates?

jokmingwong commented 2 months ago

Just used the awq model without "--quantization" and vLLM will automatically detect this. And the you will see the INFO message that "the model is convertible to awq_marlin during runtime. Using awq_marlin kernel". My vLLM version is 0.5.4. And I will close the issue.