CUDA OOM Error during Compression of Meta-Llama/Llama-2-13B-Chat-HF to W8A8 on 80GB A800 GPU

hxer7963 commented 2 months ago

Describe the bug When attempting to compress the Meta-Llama/Llama-2-13b-chat-hf model to W8A8 using a combination of GPTQ and SmoothQuant algorithms on an NVIDIA A800 GPU with 80GB of VRAM, I encountered a CUDA OOM (Out of Memory) error. The issue occurs specifically when compressing model.layers.40

Expected behavior The model’s original weight file size is 26GB in FP16 format, so I expected that 80GB of GPU memory should be sufficient to complete the quantization. However, the OOM occurs before compression completes, suggesting that memory usage during the compression process significantly exceeds the weight file size.

Environment PyTorch version: 2.4.0+cu121 Is debug build: False CUDA used to build PyTorch: 12.1 ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64) GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 Clang version: 14.0.0-1ubuntu1.1 CMake version: version 3.27.7 Libc version: glibc-2.35

Python version: 3.8.18 (default, Sep 11 2023, 13:40:15) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.119-19-0009.11-x86_64-with-glibc2.17 Is CUDA available: True CUDA runtime version: 12.4.131 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A800-SXM4-80GB Nvidia driver version: 470.182.03 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.1.0 /usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.1.0 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 48 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 232 On-line CPU(s) list: 0-231 Vendor ID: AuthenticAMD Model name: AMD EPYC 7K83 64-Core Processor CPU family: 25 Model: 1 Thread(s) per core: 2 Core(s) per socket: 58 Socket(s): 2 Stepping: 1 BogoMIPS: 4890.80 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core invpcid_single ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat umip pku ospke vaes vpclmulqdq rdpid fsrm Hypervisor vendor: KVM Virtualization type: full L1d cache: 3.6 MiB (116 instances) L1i cache: 3.6 MiB (116 instances) L2 cache: 58 MiB (116 instances) L3 cache: 512 MiB (16 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-115 NUMA node1 CPU(s): 116-231 Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Full AMD retpoline, IBPB conditional, IBRS_FW, STIBP conditional, RSB filling Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Versions of relevant libraries: [pip3] numpy==1.24.4 [pip3] nvidia-cublas-cu11==11.10.3.66 [pip3] nvidia-cublas-cu12==12.1.3.1 [pip3] nvidia-cuda-cupti-cu11==11.7.101 [pip3] nvidia-cuda-cupti-cu12==12.1.105 [pip3] nvidia-cuda-nvrtc-cu11==11.7.99 [pip3] nvidia-cuda-nvrtc-cu12==12.1.105 [pip3] nvidia-cuda-runtime-cu11==11.7.99 [pip3] nvidia-cuda-runtime-cu12==12.1.105 [pip3] nvidia-cudnn-cu11==8.5.0.96 [pip3] nvidia-cudnn-cu12==9.1.0.70 [pip3] nvidia-cufft-cu11==10.9.0.58 [pip3] nvidia-cufft-cu12==11.0.2.54 [pip3] nvidia-curand-cu11==10.2.10.91 [pip3] nvidia-curand-cu12==10.3.2.106 [pip3] nvidia-cusolver-cu11==11.4.0.1 [pip3] nvidia-cusolver-cu12==11.4.5.107 [pip3] nvidia-cusparse-cu11==11.7.4.91 [pip3] nvidia-cusparse-cu12==12.1.0.106 [pip3] nvidia-dali-cuda120==1.36.0 [pip3] nvidia-ml-py==12.560.30 [pip3] nvidia-nccl-cu11==2.14.3 [pip3] nvidia-nccl-cu12==2.20.5 [pip3] nvidia-nvimgcodec-cu12==0.2.0.7 [pip3] nvidia-nvjitlink-cu12==12.4.99 [pip3] nvidia-nvtx-cu11==11.7.91 [pip3] nvidia-nvtx-cu12==12.1.105 [pip3] pynvml==11.5.0 [pip3] pyzmq==26.2.0 [pip3] torch==2.4.0 [pip3] torchvision==0.19.0 [pip3] transformers==4.44.2 [pip3] transformers-stream-generator==0.0.4 [pip3] triton==3.0.0 [pip3] tritonclient==2.43.0 [conda] numpy 1.24.4 pypi_0 pypi [conda] nvidia-cublas-cu11 11.10.3.66 pypi_0 pypi [conda] nvidia-cublas-cu12 12.1.3.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu11 11.7.101 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cuda-runtime-cu11 11.7.99 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.1.105 pypi_0 pypi [conda] nvidia-cudnn-cu11 8.5.0.96 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.1.0.70 pypi_0 pypi [conda] nvidia-cufft-cu11 10.9.0.58 pypi_0 pypi [conda] nvidia-cufft-cu12 11.0.2.54 pypi_0 pypi [conda] nvidia-curand-cu11 10.2.10.91 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.2.106 pypi_0 pypi [conda] nvidia-cusolver-cu11 11.4.0.1 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.4.5.107 pypi_0 pypi [conda] nvidia-cusparse-cu11 11.7.4.91 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.1.0.106 pypi_0 pypi [conda] nvidia-ml-py 12.560.30 pypi_0 pypi [conda] nvidia-nccl-cu11 2.14.3 pypi_0 pypi [conda] nvidia-nccl-cu12 2.20.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.4.99 pypi_0 pypi [conda] nvidia-nvtx-cu11 11.7.91 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.1.105 pypi_0 pypi [conda] pynvml 11.5.0 pypi_0 pypi [conda] pyzmq 26.2.0 pypi_0 pypi [conda] torch 2.4.0 pypi_0 pypi [conda] torchvision 0.19.0 pypi_0 pypi [conda] transformers 4.44.2 pypi_0 pypi [conda] transformers-stream-generator 0.0.4 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi [conda] tritonclient 2.43.0 pypi_0 pypi ROCM Version: Could not collect Neuron SDK Version: N/A vLLM Version: 0.6.0@32e7db25365415841ebc7c4215851743fbb1bad1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU0 X 116-231 1

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

To Reproduce Exact steps to reproduce the behavior: Modify the MODEL_ID in examples/quantization_w8a8_int8/llama3_example.py to meta-llama/Llama-2-13b-chat-hf, run the compress on one 80GB A800 GPU.

Errors The issue occurs specifically when compressing model.layers.40 using GPTQ, but SmoothQuant compression passes successfully.

 python compress.py 
Loading checkpoint shards: 100%|██████████████████████| 10/10 [00:14<00:00,  1.48s/it]
2024-09-06T18:25:30.560482+0800 | main | WARNING - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: False
2024-09-06T18:25:30.561738+0800 | main | INFO - Training/evaluation parameters TrainingArguments(
_n_gpu=1,
accelerator_config={'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None, 'use_configured_state': False},
...
2024-09-06T18:25:30.607266+0800 | _check_create_state | INFO - State created for compression lifecycle
2024-09-06T18:25:30.609257+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
2024-09-06T18:25:30.609495+0800 | pre_initialize_structure | INFO - Compression lifecycle structure pre-initialized for 0 modifiers
Detected kernel version 5.4.119, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-09-06T18:25:30.661723+0800 | one_shot | INFO - *** One Shot ***
2024-09-06T18:25:30.666487+0800 | from_modifiers | INFO - Creating recipe from modifiers
2024-09-06T18:25:30.667419+0800 | create_instance | WARNING - Could not process input as a file path or zoo stub, attempting to process it as a string.
2024-09-06T18:25:30.786620+0800 | _check_compile_recipe | INFO - Recipe compiled and 1 modifiers created
2024-09-06T18:25:32.158512+0800 | _calibrate | INFO - Running SmoothQuantModifier calibration with 512 samples...
100%|█████████████████████████████| 512/512 [03:41<00:00,  2.31it/s]
2024-09-06T18:29:13.938457+0800 | _apply_smoothing | INFO - Smoothing activation scales...
2024-09-06T18:29:14.139377+0800 | on_initialize_structure | WARNING - GPTQ quantization is set to True without an active quantization modifier.
2024-09-06T18:29:14.139649+0800 | _build_quant_modifier | INFO - Building quantization modifier with args: {'targets': 'Linear', 'scheme': 'W8A8', 'ignore': ['lm_head']}
2024-09-06T18:29:14.176179+0800 | _check_calibration_data | INFO - Skipping QuantizationModifier calibration, it is not required for the provided quantization config.
2024-09-06T18:29:14.332806+0800 | initialize_compression | INFO - Preparing model.layers.0 for compression
2024-09-06T18:29:14.338628+0800 | initialize_compression | INFO - Preparing model.layers.1 for compression
...
2024-09-06T18:29:14.670908+0800 | initialize_compression | INFO - Preparing model.layers.39 for compression
2024-09-06T18:29:14.679910+0800 | **apply_compression | INFO - Running GPTQModifier calibration with 512 samples**...
  0%|                                                                                                         | 0/512 
 File "/root/.cache/huggingface/modules/transformers_modules/XVERSE-13B-Chat-latest/modeling_xverse.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
  File "/root/.cache/huggingface/modules/transformers_modules/XVERSE-13B-Chat-latest/modeling_xverse.py", line 248, in forward
    attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
  File "/mnt/llm_dataset/willhe/miniconda3/envs/vllm/lib/python3.8/site-packages/torch/nn/functional.py", line 1890, in softmax
    ret = input.softmax(dim, dtype=dtype)
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 616.00 MiB. GPU 0 has a total capacity of 79.35 GiB of which 398.19 MiB is free. Process 21946 has 78.96 GiB memory in use. Of the allocated memory 78.36 GiB is allocated by PyTorch, and 97.93 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Additional context Model: Meta-Llama/Llama-2-13b-chat-hf Compression Tool: llm-compressor Quantization Algorithms: GPTQ + SmoothQuant Hardware: NVIDIA A800 GPU (80GB VRAM) Quantization Scheme: W8A8 (8-bit weights and activations)

Questions: Algorithm Clarification: Why is it necessary to mix both GPTQ and SmoothQuant algorithms for quantizing weights and activations? Can SmoothQuant alone be used to quantize both weights and activations, or does GPTQ offer benefits that SmoothQuant alone cannot provide?

Memory Usage Issue: Given that the model weight file is 26GB in FP16, why is 80GB of GPU memory insufficient to complete quantization, particularly when the OOM happens at layer 40? What factors during the compression process could cause memory usage to increase so significantly (e.g., intermediate activations, temporary data structures, etc.)?

robertgshaw2-neuralmagic commented 1 month ago

GPTQ requires inverting the Hessian matrix, which has overhead relative to the size of the model. If you set sequential_update=True in your GPTQModifier, the quantization experiment will take a longer time, but there will be significantly reduced memory usage (since we will compute the hessian for one layer at a time, rather than all at once)

hxer7963 commented 1 month ago

GPTQ requires inverting the Hessian matrix, which has overhead relative to the size of the model. If you set sequential_update=True in your GPTQModifier, the quantization experiment will take a longer time, but there will be significantly reduced memory usage (since we will compute the hessian for one layer at a time, rather than all at once)

Thank you for the detailed response. I have one more question regarding the algorithm clarification:

Could you explain why both GPTQ and SmoothQuant algorithms are used together for quantizing weights and activations? Is it possible to use SmoothQuant alone for both weights and activations, or does GPTQ provide specific benefits or advantages that SmoothQuant alone does not?

Given that our business model currently only utilizes SmoothQuant, we are uncertain whether incorporating GPTQ quantization will impact the model’s performance. As a more direct approach, would it be feasible to include only the SmoothQuantModifier in the recipe and exclude the GPTQModifier? If so, can this approach quantify whether the model’s performance is affected?

robertgshaw2-neuralmagic commented 1 month ago

GPTQ requires inverting the Hessian matrix, which has overhead relative to the size of the model. If you set sequential_update=True in your GPTQModifier, the quantization experiment will take a longer time, but there will be significantly reduced memory usage (since we will compute the hessian for one layer at a time, rather than all at once)

Thank you for the detailed response. I have one more question regarding the algorithm clarification:

Could you explain why both GPTQ and SmoothQuant algorithms are used together for quantizing weights and activations? Is it possible to use SmoothQuant alone for both weights and activations, or does GPTQ provide specific benefits or advantages that SmoothQuant alone does not?

Given that our business model currently only utilizes SmoothQuant, we are uncertain whether incorporating GPTQ quantization will impact the model’s performance. As a more direct approach, would it be feasible to include only the SmoothQuantModifier in the recipe and exclude the GPTQModifier? If so, can this approach quantify whether the model’s performance is affected?

SmoothQuant is an algorithm that makes it easier to quantize the activations.
GPTQ is an algorithm that quantizes the activations

Typically, GPTQ is useful for recovering accuracy, when quantizing weights, especially for W4. However, if you want to use simple round-to-nearest quantization, you can swap out GPTQModifier for QuantizationModifier, which will use RTN. We are working on an implementation of AWQModifier

The goal of llm-compressor is to make it easy to chain these various algorithms together.

vllm-project / llm-compressor

CUDA OOM Error during Compression of Meta-Llama/Llama-2-13B-Chat-HF to W8A8 on 80GB A800 GPU #146