Performance of fine-tuned model imported into Ollama from adapters differs from Unsloth inference

danib08 commented 3 weeks ago

Since saving models to GGUF format right now is broken because of abysmal differences in performance, I am importing my fine-tuned model to Ollama from it's LoRa adapters but the inference results I'm getting are not as good as the ones I'm getting through Unsloth (although they are better than GGUF results).

I'm training the 4-bit version of Gemma-2-2b and saving the adapters. Then I'm importing it to Ollama through this Modelfile (template not shown):

FROM gemma2:2b-text-q8_0
ADAPTER path/to/lora

I've tried setting the base model in the Modelfile to the q4_0 and fp16 versions but I'm obtaining the best results using q8. I don't know if this is caused because the base Gemma-2-2b model Unsloth uses is different from the Ollama one? Or if I'm using the wrong version of the model?

Any help is appreciated since I really do need to use the model in Ollama with the same performance!

moreHS commented 3 weeks ago

I have a similar situation. When you learn gemma2-2b and load the adapter using unsloth and load the merged model, the Inference results are very different in both situations. When you load the adapter, you get a good result, but when you use the merged model, you get a terrible result.

When I learned, I turned off load_4bit option and used gemma2-2b shared by unsloth, and merge_method used 16 bits.

danib08 commented 3 weeks ago

Yeah, saving the merged model from the non-4bit version and importing it into Ollama also gives good results, but they are still not as good as when I run the model with Unsloth inference. I'm wondering if it's in Ollama thing

danielhanchen commented 3 weeks ago

So sorry on the delay! Yes so in general merging o 16bit might have some performance hit, but LoRA adapters inside of Ollama seem to do fine - I was planning to fix this possibly in the next few days

danib08 commented 3 weeks ago

thank you so much for answering, and no worries!

that clarifies a lot. and yes, LoRA adapters work fine inside Ollama but not as good as loading them on Unsloth. I also loaded them through HuggingFace, just like your notebooks show, and I get the same inference results as in Unsloth. That's why I'm confused and even thinking Ollama is the issue

unslothai / unsloth

Performance of fine-tuned model imported into Ollama from adapters differs from Unsloth inference #1073