unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
18.23k stars 1.27k forks source link

Error while saving to GGUF model #626

Closed MaheshAwasare closed 4 months ago

MaheshAwasare commented 5 months ago

Hi @danielhanchen , Tried to save GGUF model but got error for following codeblock.

Save to 8bit Q8_0

if True: model.save_pretrained_gguf("model", tokenizer,)

Following error is thrown

/usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer) 935 elif first_conversion == "q8_0" : pass 936 else: --> 937 raise RuntimeError( 938 f"Unsloth: first_conversion can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not {first_conversion}." 939 )

RuntimeError: Unsloth: first_conversion can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not f16.

URL - https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=FqfebeAdT073

MaheshAwasare commented 5 months ago

More details

`# Save to 8bit Q8_0 if True: model.save_pretrained_gguf("model", tokenizer,) #Trying to save this if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")

Save to 16bit GGUF

if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16") if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")

Save to q4_k_m GGUF

if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")`

Unsloth: Will remove a cached repo with size 1.2K Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 6.1 out of 12.67 RAM for saving. 100%|██████████| 32/32 [01:36<00:00, 3.01s/it] Unsloth: Saving tokenizer... Done. Unsloth: Saving model... This might take 5 minutes for Llama-7b... Unsloth: Saving model/pytorch_model-00001-of-00004.bin... Unsloth: Saving model/pytorch_model-00002-of-00004.bin... Unsloth: Saving model/pytorch_model-00003-of-00004.bin... Unsloth: Saving model/pytorchmodel-00004-of-00004.bin... Done. ==((====))== Unsloth: Conversion from QLoRA to GGUF information \ /| [0] Installing llama.cpp will take 3 minutes. O^O/ \/ \ [1] Converting HF to GUUF 16bits will take 3 minutes. \ / [2] Converting GGUF 16bits to q8_0 will take 20 minutes. "-____-" In total, you will have to wait around 26 minutes.


RuntimeError Traceback (most recent call last) in <cell line: 2>() 1 # Save to 8bit Q8_0 ----> 2 if True: model.save_pretrained_gguf("model", tokenizer,) 3 if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "") 4 5 # Save to 16bit GGUF

1 frames /usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer) 935 elif first_conversion == "q8_0" : pass 936 else: --> 937 raise RuntimeError( 938 f"Unsloth: first_conversion can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not {first_conversion}." 939 )

RuntimeError: Unsloth: first_conversion can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not f16.

What could be the issue?

danielhanchen commented 5 months ago

@MaheshAwasare Apologies! Just solved! Please update Unsloth on a local machine (Colab, Kaggle just restart) with:

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
MaheshAwasare commented 5 months ago

Thanks @danielhanchen, it works now.

Writing: 100%|██████████| 14.5G/14.5G [03:06<00:00, 77.8Mbyte/s] INFO:hf-to-gguf:Model successfully exported to 'model-unsloth.F16.gguf' Unsloth: Conversion completed! Output location: ./model-unsloth.F16.gguf

danielhanchen commented 5 months ago

Great!

MaheshAwasare commented 5 months ago

I think the issue is once again surfaced. I have purposefully commented out the user and repo and put user/repo for logging issue here. I am using valid repo and valid token.

Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 6.14 out of 12.67 RAM for saving. 100%|██████████| 32/32 [01:10<00:00, 2.20s/it] Unsloth: Saving tokenizer... Done. Unsloth: Saving model... This might take 5 minutes for Llama-7b... Unsloth: Saving user/repo/pytorch_model-00001-of-00004.bin... Unsloth: Saving user/repo/pytorch_model-00002-of-00004.bin... Unsloth: Saving user/repo/pytorch_model-00003-of-00004.bin... Unsloth: Saving user/repo/pytorch_model-00004-of-00004.bin... Done.

RuntimeError Traceback (most recent call last) in <cell line: 7>() 5 # Save to 16bit GGUF 6 if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16") ----> 7 if True: model.push_to_hub_gguf("user/repo", tokenizer, quantization_method = "f16", token = "") 8 9 # Save to q4_k_m GGUF

1 frames /usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer) 898 for key, value in ALLOWED_QUANTS.items(): 899 error += f"[{key}] => {value}\n" --> 900 raise RuntimeError(error) 901 pass 902

RuntimeError: Unsloth: Quant method = [f] not supported. Choose from below: [not_quantized] => Recommended. Fast conversion. Slow inference, big files. [fast_quantized] => Recommended. Fast conversion. OK inference, OK file size. [quantized] => Recommended. Slow conversion. Fast inference, small files. [f32] => Not recommended. Retains 100% accuracy, but super slow and memory hungry. [bf16] => Bfloat16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry. [f16] => Float16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry. [q8_0] => Fast conversion. High resource use, but generally acceptable. [q4_k_m] => Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K [q5_k_m] => Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K [q2_k] => Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors. [q3_k_l] => Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K [q3_k_m] => Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K [q3_k_s] => Uses Q3_K for all tensors [q4_0] => Original quant method, 4-bit. [q4_1] => Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. [q4_k_s] => Uses Q4_K for all tensors [q4_k] => alias for q4_k_m [q5_k] => alias for q5_k_m [q5_0] => Higher accuracy, higher resource usage and slower inference. [q5_1] => Even higher accuracy, resource usage and slower inference. [q5_k_s] => Uses Q5_K for all tensors [q6_k] => Uses Q8_K for all tensors [q3_k_xs] => 3-bit extra small quantization

chrehall68 commented 5 months ago

There were breaking changes made in the nightly a couple of days ago, but they were fixed in #654. If you update Unsloth (see danielhanchen's instructions above) again, you should be fine!

MaheshAwasare commented 4 months ago

Yes it is fixed. Thanks