turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs
MIT License
3.19k stars 234 forks source link

CUDA OOM when quantizing Llama-3-8B-Instruct #419

Closed dog3-l0ver closed 2 months ago

dog3-l0ver commented 2 months ago

Hello, I hope this is not an oversight on my part, but I am unable to quantize the new Llama-3-8B-Instruct.

I have an 8GB NVIDIA RTX 2070 and 64GB RAM. I'm on the latest 0.0.19 version and quantizing the same Llama-3 reupload from NousResearch as you did with the following command:

python convert.py -i ./model/Meta-Llama-3-8B-Instruct -o work_dir -cf ./model/Meta-Llama-3-8B-Instruct_exl2_4_25 -hb 6 -b 4.25 -c ./dataset/WizardLM_evol_instruct_70k.parquet -nr

I have previously been successful in quantizing multiple other models up to 11B parameters (haven't tried anything above) using the same command (with different paths of course) with version 0.0.17. I will try to quantize the other models with version 0.0.19 to see if this is model-specific, meanwhile this is what I get when resuming quantization after the OOM (It's always after layer 31 regardless if resuming or starting from scratch):

-- Resuming job
 !! Note: Overriding options with settings from existing job
 -- Input: ./model/Meta-Llama-3-8B-Instruct
 -- Output: work_dir
 -- Calibration dataset: ./dataset/WizardLM_evol_instruct_70k.parquet, 100 / 16 rows, 2048 tokens per sample
 -- Target bits per weight: 4.25 (decoder), 6 (head)
 -- Max shard size: 8192 MB
 -- Full model will be compiled to: ./model/Meta-Llama-3-8B-Instruct_exl2_4_25
 -- Quantizing...
 -- Layer: model.layers.30 (Attention)
 -- Linear: model.layers.30.self_attn.q_proj -> 0.1:5b_64g/0.9:4b_64g s4, 4.18 bpw
 -- Linear: model.layers.30.self_attn.k_proj -> 0.1:5b_64g/0.9:4b_64g s4, 4.20 bpw
 -- Linear: model.layers.30.self_attn.v_proj -> 1:5b_64g s4, 5.09 bpw
 -- Linear: model.layers.30.self_attn.o_proj -> 0.1:5b_64g/0.9:4b_64g s4, 4.18 bpw
 -- Module quantized, rfn_error: 0.013087
 -- Layer: model.layers.30 (MLP)
 -- Linear: model.layers.30.mlp.gate_proj -> 1:4b_128g s4, 4.03 bpw
 -- Linear: model.layers.30.mlp.up_proj -> 1:4b_32g s4, 4.13 bpw
 -- Linear: model.layers.30.mlp.down_proj -> 0.05:8b_32g/0.95:4b_128g s4, 4.25 bpw
 -- Module quantized, rfn_error: 0.027365
 -- Layer: model.layers.31 (Attention)
 -- Linear: model.layers.31.self_attn.q_proj -> 0.1:6b_128g/0.9:5b_128g s4, 5.16 bpw
 -- Linear: model.layers.31.self_attn.k_proj -> 0.1:6b_128g/0.9:5b_128g s4, 5.19 bpw
 -- Linear: model.layers.31.self_attn.v_proj -> 1:6b_128g s4, 6.06 bpw
 -- Linear: model.layers.31.self_attn.o_proj -> 0.1:6b_128g/0.9:5b_128g s4, 5.16 bpw
 -- Module quantized, rfn_error: 0.007477
 -- Layer: model.layers.31 (MLP)
 -- Linear: model.layers.31.mlp.gate_proj -> 0.1:6b_32g/0.9:5b_32g s4, 5.23 bpw
 -- Linear: model.layers.31.mlp.up_proj -> 0.25:6b_32g/0.75:5b_32g s4, 5.38 bpw
 -- Linear: model.layers.31.mlp.down_proj -> 0.05:8b_32g/0.1:6b_32g/0.85:5b_32g s4, 5.39 bpw
 -- Module quantized, rfn_error: 0.017178
 -- Layer: model.norm (RMSNorm)
 -- Module quantized, rfn_error: 0.000000
 -- Layer: lm_head (Linear)
 -- Linear: lm_head -> 0.15:8b_128g/0.85:6b_128g s4, 6.34 bpw
 !! Out of memory (Q), moving to device 1
Traceback (most recent call last):
  File "/home/dog3_l0ver/AI/text-generation-webui/exllamav2/convert.py", line 268, in <module>
    quant(job, save_job, model)
  File "/home/dog3_l0ver/AI/text-generation-webui/installer_files/env/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/dog3_l0ver/AI/text-generation-webui/exllamav2/conversion/quantize.py", line 416, in quant
    quant_lm_head(job, module, hidden_states, quantizers, attn_params, rtn)
  File "/home/dog3_l0ver/AI/text-generation-webui/exllamav2/conversion/quantize.py", line 206, in quant_lm_head
    quant_linear(job, module, q, qp.get_dict(), drop = True, rtn = rtn)
  File "/home/dog3_l0ver/AI/text-generation-webui/exllamav2/conversion/quantize.py", line 63, in quant_linear
    lq.quantize(keep_qweight = True, apply = True)
  File "/home/dog3_l0ver/AI/text-generation-webui/exllamav2/conversion/adaptivegptq.py", line 518, in quantize
    raise e
  File "/home/dog3_l0ver/AI/text-generation-webui/exllamav2/conversion/adaptivegptq.py", line 465, in quantize
    self.qweight = torch.zeros_like(weights, dtype = torch.short)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1002.00 MiB. GPU 0 has a total capacity of 7.78 GiB of which 1.10 GiB is free. Process 1727 has 10.78 MiB memory in use. Including non-PyTorch memory, this process has 5.17 GiB memory in use. Of the allocated memory 4.98 GiB is allocated by PyTorch, and 43.84 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Just to be clear, I'm running Exllamav2 from withing TGW's Conda environment. I have updated the exllamav2 package using the 0.0.19 .whl file from the releases page.

As a sidenote, thank you very much for this amazing work you give us for free. It still stings a little when people discuss having multiple 3090s or 4090s for LLMs, but thanks to you I can at least run a plethora of smaller, but still capable models at blazing fast speeds and without them being noticeably lobotomized haha.

EDIT: No problems quantizing WestLake-10.7B-v2 with 0.0.19 version.

EDIT2: Exact same OOM when trying to quantize Llama-3 to 4bpw instead of 4.25bpw.

EDIT3: And again when trying 3bpw. Same "Tried to allocate 1002.00MiB" and still after layer 31.

EDIT4: Stopped SDDM, killed KWin, made sure my VRAM was completely empty. Still OOM after layer 31. I refuse to believe an 8B model that's around 15GB in size has a single layer bigger than 8GB.

turboderp commented 2 months ago

Llama3-8B has a much larger output layer. It simply requires a lot more VRAM to convert the final part of the model than Llama2-7B. I'm not really sure if there's anything that can be done about it, realistically.

dog3-l0ver commented 2 months ago

Oh, that is quite unfortunate. I just recently learned how to quantize using your awesome piece of code so I could stop being dependent on others for exl quants. It's a shame I'll be back to square one with all the upcoming Llama 3 fine-tunes and merges. I also apologize for my lacking knowledge about the model layers, I'm still very much learning and intuition, as it often does, has failed. Thank you for taking the time to respond and once again for your efforts with this project. Have a good day.