unslothai / unsloth

Finetune Llama 3.1, Mistral, Phi & Gemma LLMs 2-5x faster with 80% less memory
https://unsloth.ai
Apache License 2.0
12.99k stars 843 forks source link

Issue with guff Conversion After Finetuning with Unsloth #695

Open mf-skjung opened 3 weeks ago

mf-skjung commented 3 weeks ago

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning. ==((====))== Unsloth: Fast Llama patching release 2024.6 \ /| GPU: NVIDIA A100 80GB PCIe MIG 7g.80gb. Max memory: 79.151 GB. Platform = Linux. O^O/ _/ \ Pytorch: 2.3.0+cu121. CUDA = 8.0. CUDA Toolkit = 12.1. \ / Bfloat16 = TRUE. Xformers = 0.0.26.post1. FA = True. "-____-" Free Apache license: http://github.com/unslothai/unsloth Unsloth: unsloth/llama-3-8b-Instruct-bnb-4bit can only handle sequence lengths of at most 8192. But with kaiokendev's RoPE scaling of 4.0, it can be magically extended to 32768! Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Unsloth 2024.6 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.

I have the above development environment. Using the Unsloth library, finetuning Llama works well, and the trained model performs correctly when loaded and used for inference with the transformers TextStreamer.

However, during the conversion process with the embedded llama.cpp in Unsloth, no error messages appear, but running the converted model in ollama fails to work properly.

20240627_023723

Notably, loading and converting a pretrained model without any training works fine. The issue arises only after performing even a single step of training with Unsloth.

Is there a solution for this problem?

Thank you.

danielhanchen commented 3 weeks ago

Apologies a lot on the delay! My bro and I just relocated to SF, so it took a while to get back to you!

Interesting - its most likely the chat template. You could try our Ollama chat templates notebook here: https://colab.research.google.com/drive/1WZDi7APtQ9VsvOrQSSC5DDtxq159j8iZ?usp=sharing

mf-skjung commented 3 weeks ago

Thank you for the update and the suggestion regarding the chat template. I appreciate your effort in providing a solution despite your recent relocation.

However, I've discovered that the issue persists even when not using Ollama. I've tested the converted GGUF file directly with llama.cpp using the following command:

./llama-simple -m ../ollama_model/llama-38-Q8_0.gguf -p "What is Your Name?"

image

The result shows the model only repeating exclamation marks after the input prompt, instead of generating a meaningful response. This behavior suggests that the problem might be in the GGUF file generation process itself, rather than being related to Ollama's chat templates.

I'm wondering if there's a way to check the version information of all dependent libraries in an environment where the model is working correctly. This could help identify any potential version mismatches or compatibility issues between my setup and a working configuration.

I'm happy to provide any additional information or logs that might be helpful in resolving this issue. Thank you for your continued support.