intel / xFasterTransformer

Apache License 2.0
370 stars 63 forks source link

Qwen2.5-0.5B-Instruct quantization with gptq error #480

Open wcollin opened 1 week ago

wcollin commented 1 week ago

xft version:1.8.2 lscpu: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Vendor ID: GenuineIntel Model name: INTEL(R) XEON(R) PLATINUM 8576C CPU family: 6 Model: 207 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 Stepping: 2 BogoMIPS: 5000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1 gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x 2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ib rs ibpb ibrs_enhanced fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm avx512f avx512dq rdseed adx smap avx512ifma clfl ushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves avx_vnni avx512_bf16 wbnoinvd arat avx512vb mi umip waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq cldemote movdiri movdir64b enqcmd fsrm serialize tsxldtrk amx_bf16 avx512_fp16 amx_tile amx_int8 arch_capabilities Virtualization features:
Hypervisor vendor: KVM Virtualization type: full Caches (sum of all):
L1d: 384 KiB (8 instances) L1i: 256 KiB (8 instances) L2: 16 MiB (8 instances) L3: 280 MiB (1 instance) NUMA:
NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerabilities:
Gather data sampling: Not affected Itlb multihit: Not affected L1tf: Not affected Mds: Not affected Meltdown: Not affected Mmio stale data: Unknown: No mitigations Reg file data sampling: Not affected Retbleed: Not affected Spec rstack overflow: Not affected Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI SW loop, KVM SW loop Srbds: Not affected Tsx async abort: Not affected

basic_usage_wikitext2.py: pretrained_model_dir = "/data/models/Qwen2.5-0.5B-Instruct-AWQ" quantized_model_dir = "/data/models/Qwen2.5-0.5B-Instruct-GPTQ"

root@fbbe4c067b4e:~/xFasterTransformer/3rdparty/AutoGPTQ/examples/quantization# python basic_usage_wikitext2.py /root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/nn_modules/triton_utils/kernels.py:411: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. def forward(ctx, input, qweight, scales, qzeros, g_idx, bits, maxq): /root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/nn_modules/triton_utils/kernels.py:419: FutureWarning: torch.cuda.amp.custom_bwd(args...) is deprecated. Please use torch.amp.custom_bwd(args..., device_type='cuda') instead. def backward(ctx, grad_output): /root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: torch.cuda.amp.custom_fwd(args...) is deprecated. Please use torch.amp.custom_fwd(args..., device_type='cuda') instead. @custom_fwd(cast_inputs=torch.float16) CUDA extension not installed. CUDA extension not installed. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Token indices sequence length is longer than the specified maximum sequence length for this model (2518423 > 131072). Running this sequence through the model will result in indexing errors Traceback (most recent call last): File "basic_usage_wikitext2.py", line 176, in main() File "basic_usage_wikitext2.py", line 149, in main model = AutoGPTQForCausalLM.from_pretrained(pretrained_model_dir, quantize_config) File "/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/modeling/auto.py", line 86, in from_pretrained return GPTQ_CAUSAL_LM_MODEL_MAP[model_type].from_pretrained( File "/root/xFasterTransformer/3rdparty/AutoGPTQ/auto_gptq/modeling/_base.py", line 604, in from_pretrained raise EnvironmentError("Load pretrained model to do quantization requires CUDA available.") OSError: Load pretrained model to do quantization requires CUDA available.

miaojinc commented 5 days ago

Hi @wcollin thanks for your test. AutoGPTQ is the 3rd party code for xFT, so xFT only works on load the quantized weights and inference on CPU. From the error message, your AutoGPTQ installed is CUDA-based. You might need to re-install it by building from source with BUILD_CUDA_EXT=0 to enable CPU.

Several months ago, I already send the CPU pull request to AutoGPTQ. But seems they are not interested on it, the PR is not merged. You can refer it to check how to quantize LLM on CPU. Majorly comments out the cuda API related code.