Closed MaheshAwasare closed 4 months ago
More details
`# Save to 8bit Q8_0 if True: model.save_pretrained_gguf("model", tokenizer,) #Trying to save this if False: model.push_to_hub_gguf("hf/model", tokenizer, token = "")
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "f16") if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "f16", token = "")
if False: model.save_pretrained_gguf("model", tokenizer, quantization_method = "q4_k_m") if False: model.push_to_hub_gguf("hf/model", tokenizer, quantization_method = "q4_k_m", token = "")`
Unsloth: Will remove a cached repo with size 1.2K Unsloth: Merging 4bit and LoRA weights to 16bit... Unsloth: Will use up to 6.1 out of 12.67 RAM for saving. 100%|██████████| 32/32 [01:36<00:00, 3.01s/it] Unsloth: Saving tokenizer... Done. Unsloth: Saving model... This might take 5 minutes for Llama-7b... Unsloth: Saving model/pytorch_model-00001-of-00004.bin... Unsloth: Saving model/pytorch_model-00002-of-00004.bin... Unsloth: Saving model/pytorch_model-00003-of-00004.bin... Unsloth: Saving model/pytorchmodel-00004-of-00004.bin... Done. ==((====))== Unsloth: Conversion from QLoRA to GGUF information \ /| [0] Installing llama.cpp will take 3 minutes. O^O/ \/ \ [1] Converting HF to GUUF 16bits will take 3 minutes. \ / [2] Converting GGUF 16bits to q8_0 will take 20 minutes. "-____-" In total, you will have to wait around 26 minutes.
RuntimeError Traceback (most recent call last)
1 frames
/usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer)
935 elif first_conversion == "q8_0" : pass
936 else:
--> 937 raise RuntimeError(
938 f"Unsloth: first_conversion
can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not {first_conversion}
."
939 )
RuntimeError: Unsloth: first_conversion
can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not f16
.
What could be the issue?
@MaheshAwasare Apologies! Just solved! Please update Unsloth on a local machine (Colab, Kaggle just restart) with:
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
Thanks @danielhanchen, it works now.
Writing: 100%|██████████| 14.5G/14.5G [03:06<00:00, 77.8Mbyte/s] INFO:hf-to-gguf:Model successfully exported to 'model-unsloth.F16.gguf' Unsloth: Conversion completed! Output location: ./model-unsloth.F16.gguf
Great!
I think the issue is once again surfaced. I have purposefully commented out the user and repo and put user/repo for logging issue here. I am using valid repo and valid token.
RuntimeError Traceback (most recent call last)
1 frames /usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer) 898 for key, value in ALLOWED_QUANTS.items(): 899 error += f"[{key}] => {value}\n" --> 900 raise RuntimeError(error) 901 pass 902
RuntimeError: Unsloth: Quant method = [f] not supported. Choose from below: [not_quantized] => Recommended. Fast conversion. Slow inference, big files. [fast_quantized] => Recommended. Fast conversion. OK inference, OK file size. [quantized] => Recommended. Slow conversion. Fast inference, small files. [f32] => Not recommended. Retains 100% accuracy, but super slow and memory hungry. [bf16] => Bfloat16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry. [f16] => Float16 - Fastest conversion + retains 100% accuracy. Slow and memory hungry. [q8_0] => Fast conversion. High resource use, but generally acceptable. [q4_k_m] => Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q4_K [q5_k_m] => Recommended. Uses Q6_K for half of the attention.wv and feed_forward.w2 tensors, else Q5_K [q2_k] => Uses Q4_K for the attention.vw and feed_forward.w2 tensors, Q2_K for the other tensors. [q3_k_l] => Uses Q5_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K [q3_k_m] => Uses Q4_K for the attention.wv, attention.wo, and feed_forward.w2 tensors, else Q3_K [q3_k_s] => Uses Q3_K for all tensors [q4_0] => Original quant method, 4-bit. [q4_1] => Higher accuracy than q4_0 but not as high as q5_0. However has quicker inference than q5 models. [q4_k_s] => Uses Q4_K for all tensors [q4_k] => alias for q4_k_m [q5_k] => alias for q5_k_m [q5_0] => Higher accuracy, higher resource usage and slower inference. [q5_1] => Even higher accuracy, resource usage and slower inference. [q5_k_s] => Uses Q5_K for all tensors [q6_k] => Uses Q8_K for all tensors [q3_k_xs] => 3-bit extra small quantization
There were breaking changes made in the nightly a couple of days ago, but they were fixed in #654. If you update Unsloth (see danielhanchen's instructions above) again, you should be fine!
Yes it is fixed. Thanks
Hi @danielhanchen , Tried to save GGUF model but got error for following codeblock.
Save to 8bit Q8_0
if True: model.save_pretrained_gguf("model", tokenizer,)
Following error is thrown
/usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_dtype, is_sentencepiece, model_directory, quantization_method, first_conversion, _run_installer) 935 elif first_conversion == "q8_0" : pass 936 else: --> 937 raise RuntimeError( 938 f"Unsloth:
first_conversion
can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and not{first_conversion}
." 939 )RuntimeError: Unsloth:
first_conversion
can only be one of ['f16', 'bf16', 'f32', 'q8_0'] and notf16
.URL - https://colab.research.google.com/drive/135ced7oHytdxu3N2DNe1Z0kqjyYIkDXp?usp=sharing#scrollTo=FqfebeAdT073