Closed hxer7963 closed 1 month ago
GPTQ
requires inverting the Hessian matrix, which has overhead relative to the size of the model. If you set sequential_update=True
in your GPTQModifier
, the quantization experiment will take a longer time, but there will be significantly reduced memory usage (since we will compute the hessian for one layer at a time, rather than all at once)
GPTQ
requires inverting the Hessian matrix, which has overhead relative to the size of the model. If you setsequential_update=True
in yourGPTQModifier
, the quantization experiment will take a longer time, but there will be significantly reduced memory usage (since we will compute the hessian for one layer at a time, rather than all at once)
Thank you for the detailed response. I have one more question regarding the algorithm clarification:
Could you explain why both GPTQ and SmoothQuant algorithms are used together for quantizing weights and activations? Is it possible to use SmoothQuant alone for both weights and activations, or does GPTQ provide specific benefits or advantages that SmoothQuant alone does not?
Given that our business model currently only utilizes SmoothQuant, we are uncertain whether incorporating GPTQ quantization will impact the model’s performance. As a more direct approach, would it be feasible to include only the SmoothQuantModifier in the recipe and exclude the GPTQModifier? If so, can this approach quantify whether the model’s performance is affected?
GPTQ
requires inverting the Hessian matrix, which has overhead relative to the size of the model. If you setsequential_update=True
in yourGPTQModifier
, the quantization experiment will take a longer time, but there will be significantly reduced memory usage (since we will compute the hessian for one layer at a time, rather than all at once)Thank you for the detailed response. I have one more question regarding the algorithm clarification:
Could you explain why both GPTQ and SmoothQuant algorithms are used together for quantizing weights and activations? Is it possible to use SmoothQuant alone for both weights and activations, or does GPTQ provide specific benefits or advantages that SmoothQuant alone does not?
Given that our business model currently only utilizes SmoothQuant, we are uncertain whether incorporating GPTQ quantization will impact the model’s performance. As a more direct approach, would it be feasible to include only the SmoothQuantModifier in the recipe and exclude the GPTQModifier? If so, can this approach quantify whether the model’s performance is affected?
SmoothQuant
is an algorithm that makes it easier to quantize the activations. GPTQ
is an algorithm that quantizes the activationsTypically, GPTQ
is useful for recovering accuracy, when quantizing weights, especially for W4. However, if you want to use simple round-to-nearest quantization, you can swap out GPTQModifier
for QuantizationModifier
, which will use RTN. We are working on an implementation of AWQModifier
The goal of llm-compressor
is to make it easy to chain these various algorithms together.
Describe the bug When attempting to compress the Meta-Llama/Llama-2-13b-chat-hf model to W8A8 using a combination of GPTQ and SmoothQuant algorithms on an NVIDIA A800 GPU with 80GB of VRAM, I encountered a CUDA OOM (Out of Memory) error. The issue occurs specifically when compressing model.layers.40
Expected behavior The model’s original weight file size is 26GB in FP16 format, so I expected that 80GB of GPU memory should be sufficient to complete the quantization. However, the OOM occurs before compression completes, suggesting that memory usage during the compression process significantly exceeds the weight file size.
Environment PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A
OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.27.7 Libc version: glibc-2.35
Python version: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB Nvidia driver version: 470.182.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 232 On-line CPU(s) list: 0-231 Vendor ID: AuthenticAMD Model name: AMD EPYC 7K83 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 58 Socket(s): 2 Stepping: 1 BogoMIPS: 4890.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm Hypervisor vendor: KVM Virtualization type: full L1d cache: 3.6 MiB (116 instances) L1i cache: 3.6 MiB (116 instances) L2 cache: 58 MiB (116 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-115 NUMA node1 CPU(s): 116-231 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected
Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.36.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvimgcodec-cu12==0.2.0.7 [pip3] nvidia-nvjitlink-cu12==12.4.99 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] transformers-stream-generator==0.0.4 [pip3] triton==3.0.0 [pip3] tritonclient==2.43.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu11 10.2.10.91 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi [conda] nvidia-nvtx-cu11 11.7.91 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] transformers-stream-generator 0.0.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] tritonclient 2.43.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X 116-231 1
Legend:
X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks
To Reproduce Exact steps to reproduce the behavior: Modify the MODEL_ID in
examples/quantization_w8a8_int8/llama3_example.py
tometa-llama/Llama-2-13b-chat-hf
, run the compress on one 80GB A800 GPU.Errors The issue occurs specifically when compressing model.layers.40 using GPTQ, but SmoothQuant compression passes successfully.
Additional context Model: Meta-Llama/Llama-2-13b-chat-hf Compression Tool: llm-compressor Quantization Algorithms: GPTQ + SmoothQuant Hardware: NVIDIA A800 GPU (80GB VRAM) Quantization Scheme: W8A8 (8-bit weights and activations)
Questions: Algorithm Clarification: Why is it necessary to mix both GPTQ and SmoothQuant algorithms for quantizing weights and activations? Can SmoothQuant alone be used to quantize both weights and activations, or does GPTQ offer benefits that SmoothQuant alone cannot provide?
Memory Usage Issue: Given that the model weight file is 26GB in FP16, why is 80GB of GPU memory insufficient to complete quantization, particularly when the OOM happens at layer 40? What factors during the compression process could cause memory usage to increase so significantly (e.g., intermediate activations, temporary data structures, etc.)?