unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.58k stars 1.3k forks source link

OOM during saving to GGUF after training #1164

Open Frank995 opened 1 month ago

Frank995 commented 1 month ago

Hi. I'm saving my model to GGUF after training. These are the utilization metrics:

155.3186 seconds used for training. 2.59 minutes used for training. Peak reserved memory = 7.939 GB. Peak reserved memory for training = 1.462 GB. Peak reserved memory % of max memory = 36.127 %. Peak reserved memory for training % of max memory = 6.653 %.

As you can see there's plenty of memory available. Nevertheless, when I run model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") it crashes for OOM:

File ~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:152, in _merge_lora(layer, name)
    [150](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:150) else:
    [151](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:151)     dtype = W.dtype
--> [152](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:152) W = W.to(torch.float32).t()
    [153](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:153) # W = W.t()
    [155](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:155) if A is not None:
    [156](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:156)     # sAB = (A.t().to(torch.float32) @ (s * B.t().to(torch.float32)))
    [157](https://vscode-remote+ssh-002dremote-002btraining-002dinstance.vscode-resource.vscode-cdn.net/home/ubuntu/training/~/miniconda3/lib/python3.12/site-packages/unsloth/save.py:157)     # W += sAB

OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU 0 has a total capacity of 21.98 GiB of which 53.12 MiB is free. Process 1841 has 6.76 GiB memory in use. Including non-PyTorch memory, this process has 15.15 GiB memory in use. Of the allocated memory 14.74 GiB is allocated by PyTorch, and 107.45 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
fajjos commented 1 month ago

If your GPU is out of memory, consider moving the model to CPU for the saving step to offload GPU memory

fajjos commented 1 month ago

or try using a less aggressive quantization method..

webbigdata-jp commented 2 weeks ago

Hi. I encountered this too. I think this is an error that didn't exist before. It takes some time, but I was able to avoid it with the code below.

        model, tokenizer = FastLanguageModel.from_pretrained(
            model_name=path_name,
            max_memory = {"cpu": "30GIB", 0: "5GIB"},
            dtype=torch.bfloat16,
            device_map="auto",
            low_cpu_mem_usage=True,
        )
        model.save_pretrained_merged(merge_name, tokenizer, save_method="merged_16bit")

edit {"cpu": "30GIB", 0: "5GIB"} for you enviroment. 5GIB is your GPU memory, cpu is your system memory.