unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
17.94k stars 1.24k forks source link

save_pretrained_gguf method RuntimeError: Unsloth: Quantization failed .... #356

Closed weedge closed 6 months ago

weedge commented 6 months ago

image

/usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_directory, quantization_method, first_conversion, _run_installer) 955 ) 956 else: --> 957 raise RuntimeError( 958 f"Unsloth: Quantization failed for {final_location}\n"\ 959 "You might have to compile llama.cpp yourself, then run this again.\n"\

RuntimeError: Unsloth: Quantization failed for ./model_gguf_q8_0-unsloth.Q8_0.gguf You might have to compile llama.cpp yourself, then run this again. You do not need to close this Python program. Run the following commands in a new terminal: You must run this in the same folder as you're saving your model. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j Once that's done, redo the quantization.


llama.cpp issue:

python llama.cpp/convert.py model_merged_16bit \
  --outfile model_gguf_q8_0-unsloth.Q8_0.gguf --vocab-type bpe \
  --outtype q8_0 --concurrency 1

use --vocab-type bpe bpe ; and outtype support f32, f16, q8_0

cnjack commented 6 months ago

face same issue

weedge commented 6 months ago

@cnjack

!python llama.cpp/convert.py model_merged_16bit \
  --outfile model_gguf-unsloth.f16.gguf --vocab-type bpe \
  --outtype f16 --concurrency 1

!./llama.cpp/quantize model_gguf-unsloth.f16.gguf model_gguf_q4_k_m-unsloth.Q4_k_m.gguf q4_k_m

!./llama.cpp/main -ngl 33 -c 0 -e \
  -p '<|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
  -r '<|eot_id|>' \
  -m model_gguf_q4_k_m-unsloth.Q4_k_m.gguf \
  && echo "The capital of France is Paris."

is ok, u can see this issuecomment: https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2066528949

weedge commented 6 months ago

more detail see this: https://github.com/weedge/doraemon-nb/blob/main/Alpaca_%2B_Llama_3_8b_full_example.ipynb

jack-michaud commented 6 months ago

Confirmed, this patch 51e4aa1 allowed me to quantize the llama3 model (switching the vocab-type to bpe in the call to llama cpp)

danielhanchen commented 6 months ago

@jack-michaud Oh cool - I'll edit Unsloth to change it to bpe!

jack-michaud commented 6 months ago

Thanks @danielhanchen! Please note that I did not test this change on Llama 2.

danielhanchen commented 6 months ago

@weedge @cnjack @jack-michaud Fixed it finally! Much apologies on the issue. On a local machine, please uninstal then reinstall ie

pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git

For colab and kaggle, you'll have to restart the runtime

jack-michaud commented 6 months ago

@danielhanchen confirmed, quantizing llama3 on the latest push of unsloth works. Thank you!