Closed weedge closed 6 months ago
face same issue
@cnjack
!python llama.cpp/convert.py model_merged_16bit \
--outfile model_gguf-unsloth.f16.gguf --vocab-type bpe \
--outtype f16 --concurrency 1
!./llama.cpp/quantize model_gguf-unsloth.f16.gguf model_gguf_q4_k_m-unsloth.Q4_k_m.gguf q4_k_m
!./llama.cpp/main -ngl 33 -c 0 -e \
-p '<|start_header_id|>user<|end_header_id|>\n\nWhat is the capital of France?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' \
-r '<|eot_id|>' \
-m model_gguf_q4_k_m-unsloth.Q4_k_m.gguf \
&& echo "The capital of France is Paris."
is ok, u can see this issuecomment: https://github.com/ggerganov/llama.cpp/pull/6745#issuecomment-2066528949
more detail see this: https://github.com/weedge/doraemon-nb/blob/main/Alpaca_%2B_Llama_3_8b_full_example.ipynb
Confirmed, this patch 51e4aa1 allowed me to quantize the llama3 model (switching the vocab-type
to bpe
in the call to llama cpp)
@jack-michaud Oh cool - I'll edit Unsloth to change it to bpe!
Thanks @danielhanchen! Please note that I did not test this change on Llama 2.
@weedge @cnjack @jack-michaud Fixed it finally! Much apologies on the issue. On a local machine, please uninstal then reinstall ie
pip uninstall unsloth -y
pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git
For colab and kaggle, you'll have to restart the runtime
@danielhanchen confirmed, quantizing llama3 on the latest push of unsloth works. Thank you!
/usr/local/lib/python3.10/dist-packages/unsloth/save.py in save_to_gguf(model_type, model_directory, quantization_method, first_conversion, _run_installer) 955 ) 956 else: --> 957 raise RuntimeError( 958 f"Unsloth: Quantization failed for {final_location}\n"\ 959 "You might have to compile llama.cpp yourself, then run this again.\n"\
RuntimeError: Unsloth: Quantization failed for ./model_gguf_q8_0-unsloth.Q8_0.gguf You might have to compile llama.cpp yourself, then run this again. You do not need to close this Python program. Run the following commands in a new terminal: You must run this in the same folder as you're saving your model. git clone https://github.com/ggerganov/llama.cpp cd llama.cpp && make clean && LLAMA_CUDA=1 make all -j Once that's done, redo the quantization.
llama.cpp issue:
use
--vocab-type bpe
bpe ; and outtype support f32, f16, q8_0