unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
15.6k stars 1.05k forks source link

Error when saving unsloth Llama 3.1 model to GGUF format #965

Open okoliechykwuka opened 2 weeks ago

okoliechykwuka commented 2 weeks ago

I'm encountering an error while trying to save the default unsloth Llama 3.1 model to GGUF format. The issue occurs when running the code on Google Colab with a T4 GPU.

Environment:

Changes made:

Code snippet:

# Save to q4_k_m GGUF
if True: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m")
if True: model.push_to_hub_gguf("chukypedro/testing", tokenizer, quantization_method = "q4_k_m", token = "api_key")

# Save to multiple GGUF options - much faster if you want multiple!
if True:
    model.push_to_hub_gguf(
        "myname/testing", # Change hf to your username!
        tokenizer,
        quantization_method = ["q4_k_m"],
        token = "", # Get a token at https://huggingface.co/settings/tokens

Error traceback:

RuntimeError                              Traceback (most recent call last)
[/usr/local/lib/python3.10/dist-packages/torch/serialization.py](https://localhost:8080/#) in __exit__(self, *args)
    497 
    498     def __exit__(self, *args) -> None:
--> 499         self.file_like.write_end_of_file()
    500         if self.file_stream is not None:
    501             self.file_stream.close()
RuntimeError: [enforce fail at inline_container.cc:603] . unexpected pos 576 vs 470

Expected behavior: The model should save successfully to GGUF format without any errors.

Actual behavior: The saving process fails with a RuntimeError, indicating an unexpected position in the file.

circle-games commented 2 weeks ago

I have the same error, I think the disk is filling up, so either use a smaller model or become a colab pro subscriber, so the disk space increases

Ammar-Alnagar commented 1 week ago

The pos error is a storage error , since whe. Saving as qkm , it gets saved as two files one qkm and the other as F16 and that usually fills up the space if you had a big dataset for finetuning or ran it for too many epochs. My advice is save it as merged 16 bit then on huggingface_hub use this space to get a qkm version of your model for free. https://huggingface.co/spaces/ggml-org/gguf-my-repo