GGUF conversion cause spelling mistakes

MChamith commented 7 months ago

So I finetuned a model using a custom dataset. The output should be in JSON format. All the keys are the same for each output, i.e. structure of the response JSON is the same while values need to be extracted from the user prompt. Finetuned a model like month ago and converted to GGUF and model's accuracy was very good. Did the same today with same dataset everything same, but the results are drastically different. For example a key in JSON , "Classification" produces differently with spelling errors. "Classificatiion", "Classificacion"

Followed the Mistral-7b notebook. Inference using Unsloth doesn't produce these errors. The results are acceptable. But when converted to GGUF even on f16 produces gibberish.

danielhanchen commented 7 months ago

@MChamith Oh no that's not good - so essentially our inference is fine, but GGUF conversion breaks? So essentially it can be either:

GGUF is broken somewhere
Our merging to 16bit is broken
I'm not using GGUF correctly

MChamith commented 7 months ago

So Infering using the below gives me 100% accuracy. My target is generating JSONs and each key value is generated perfectly through the below code.

model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora model",
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
FastLanguageModel.for_inference(model)

When converted to GGUF results are horrible. Some keys missing, wrong, or spelled incorrectly.

I'm sorry if I sound naive since I'm a beginner to this. But are we using the same merging method when in both cases? The first case is inferring using the above method and other before converting to GGUF.

I also tried manually installing an older branch (b2143) of llama.cpp to avoid unsloth installing the newer version to check if the issue is with the latest llama.cpp. selected this specific branch since it was around the time I got better results last time. But the results were still the same.

danielhanchen commented 7 months ago

@MChamith Hmmm so even an older llama.cpp version doesn't work? So essentially it probably is something I did that broke saving to GGUF?

danielhanchen commented 7 months ago

@MChamith Actually you're not crazy! I tried even normal merging and it seems incorrect!! I'm working on a fix ASAP!

danielhanchen commented 7 months ago

Ok I have some prelim results - I tried it mutliple times on different prompts - it seems like upcasting to float16 from int4 bitsandbytes can cause small rounding issues which causes incorrect results - I also tried this on Unsloth code, (ie pure HF), and the problem persists.

Previously I made Unsloth code upcast to float32 internally, which probably made it work very well. Last month ish, I made a change to exactly match HF, and this is probably what's causing the issues. I may put an option for supreme accuracy, but this will require a rewrite of kernels. Another approach is I can return the float32 kernels back

hosteren commented 7 months ago

It seems to have been fixed along with Gemma last week. I use Unsloth through Llama Factory and had no problems if I did not convert the models to GGUF. After quantisation they pretty much responded in brainfuck. As you mentioned, everything was dandy until last month ish, and my tests confirm that. This probably isn't that helpful anymore, but thank you for fixing whatever that was that quick. Your library and effort is amazing.

danielhanchen commented 7 months ago

Oh great! :) Thanks @hosteren for the kind words and appreciate it :)

VirajKanse commented 5 months ago

Oh great! :) Thanks @hosteren for the kind words and appreciate it :)

Hey @danielhanchen, I am facing same issue for mistral/llama3, after converting to GGUF the accuracy is dropped significantly plus the spelling mistakes are there. If you have some quick fix, that would be really helpful.

erwe324 commented 5 months ago

@VirajKanse I believe that GGUF conversion is handled by llama.cpp project. Unsloth simply calls it.

MChamith commented 5 months ago

As @danielhanchen mentioned earlier issue could be when upscaling and merging the lora with the model. So might not be an issue with GGUF conversion. For me still the issues persist. Accuracy very low with many spelling mistakes when merged and converted to GGUF. However, inferring through FastLanguageModel I get 100% accuracy.

danielhanchen commented 5 months ago

Hmmm I do know recently llama.cpp has some issues with GGUF itself - unsure if that is related. I'll reinvestigate

VirajKanse commented 5 months ago

Hmmm I do know recently llama.cpp has some issues with GGUF itself - unsure if that is related. I'll reinvestigate

I noticed that when I use the untouched Mistral model (non-quantized) as the base, finetune and then convert it to GGUF, there are no spelling mistakes. (I will re-confirm it)

hosteren commented 5 months ago

I have had no problems, other than self inflicted ones, since the fix back in February. I use the Ollama Quantize docker image to quantise models though.

danielhanchen commented 5 months ago

Hmmm I was planning to do a large investigation today and tomorrow for llama-3 specifically - i will keep you all updated!

MChamith commented 5 months ago

Hmmm I was planning to do a large investigation today and tomorrow for llama-3 specifically - i will keep you all updated!

Hey if it helps, this is the training set I trained on cmarvolo/auto_fl and the fine-tuned lora cmarvolo/mistral-7b-fed-auto-lora hosted in hugging face.

As I said following,

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "cmarvolo/mistral-7b-fed-auto-lora", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model)

gives me 99.8% accuracy on a similar dataset created as the training dataset above. However, GGUF conversion accuracy is pretty bad on the same dataset.

VirajKanse commented 5 months ago

Hmmm I was planning to do a large investigation today and tomorrow for llama-3 specifically - i will keep you all updated!

Hey if it helps, this is the training set I trained on cmarvolo/auto_fl and the fine-tuned lora cmarvolo/mistral-7b-fed-auto-lora hosted in hugging face.

As I said following,

model, tokenizer = FastLanguageModel.from_pretrained( model_name = "cmarvolo/mistral-7b-fed-auto-lora", max_seq_length = max_seq_length, dtype = dtype, load_in_4bit = load_in_4bit, ) FastLanguageModel.for_inference(model)

gives me 99.8% accuracy on a similar dataset created as the training dataset above. However, GGUF conversion accuracy is pretty bad on the same dataset.

@MChamith Could you try using an untouched Mistral model as the base (e.g., PIXMELT/Mistral-7B-Instruct-v0.2) for fine-tuning and converting to GGUF? I've noticed this yields better results compared to using unsloth/mistral-7b-instruct-v0.2-bnb-4bit (I just want to re-confirm it).

danielhanchen commented 5 months ago

Maybe stuff might be fixed - cannot confirm - it seems like using the GPU to save GGUF files breaks things, so now Unsloth defaults to using CPU only

BenjaminBruenau commented 2 months ago

Hello I had a similar problem where inference after finetuning mistralai/Mistral-7B-v0.3 (with huggingface transformers & peft) worked fine and produced acceptable results but completely failed after converting to gguf via unsloth. The model was only repeating itself / producing gibberish.

I am not an expert in this domain so some of my assumptions might be wrong. My error might be something else entirely (e.g. bfloat16 on Turing Architecture) but I thaught I'll share this anyway.

I finetuned my model with bnb_4bit_compute_dtype=torch.bfloat16 as my machine with a RTX Titan 24GB seemed to support it. (But I also read that bfloat16 is only supported on the 30s series / ampere architecture and newer so at this point I am not really sure)

import torch
torch.cuda.is_bf16_supported() # returns True
torch.cuda.get_device_capability() # returns (7, 5)

The code I used to convert to gguf:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained("path_to_lora_checkpoint", load_in_4bit = True, dtype=torch.bfloat16)
model.save_pretrained_gguf("path_to_save_gguf", tokenizer, quantization_method = "q4_k_m")

Unsloth will change the dtype to float16 under the hood though as torch.cuda.get_device_capability() is not greater or equal to 8.0. It will also log this: Device does not support bfloat16. Will change to float16.. (Via the SUPPORTS_BFLOAT16 check here: https://github.com/unslothai/unsloth/blob/a7bfbe7927ea75f959e1d7c84e7bf50945d405ff/unsloth/models/_utils.py#L141)

I worked around this by using huggingface and llama.cpp directly (as I assume it is done by unsloth in a similar fashion):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
from peft import PeftModel

# device_map={"": "cpu"} when loading models as to not exceed the limits of my gpu memory when merging and unloading

base_model = AutoModelForCausalLM.from_pretrained("mistralai/Mistral-7B-v0.3", torch_dtype=torch.bfloat16, device_map={"": "cpu"}) 
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-v0.3")

model = PeftModel.from_pretrained(base_model, ""path_to_lora_checkpoint", device_map={"": "cpu"})

model = model.merge_and_unload(safe_merge=True, progressbar=True)

state_dict= model.state_dict

model.save_pretrained("unload", state_dict=state_dict())
tokenizer.save_pretrained("unload")

And then run

python llama.cpp/convert_hf_to_gguf.py unload --outfile <path_to_generated_gguf> --outtype bf16

./llama.cpp/llama-quantize <path_to_generated_gguf> <path_to_quantized_gguf> Q4_K_M

The quantized gguf now produces acceptable results with llama.cpp .

danielhanchen commented 2 months ago

@BenjaminBruenau Thanks for the detailed comment! Interesting on the precision issue - unsure if Titan RTX has bf16 support, but ye a hack was to simply check if it's >= 8.0 - but if it works now, then thats great!

thegenerativegeneration commented 1 month ago

Has this been solved? I also have problems with the model giving quite bad output after conversion to GGUF (repetitive behavior).

BenjaminBruenau commented 1 month ago

I am afraid not really. While my approach did make slightly better, a significant quality loss and repetitive behaviour still occured compared to the merged, quantized huggingface version of the model.

This is most likely not an issue with unsloth itself but probably rather somewhere along the way of the conversion process between merged LoRA + huggingface models into .gguf format. I have also tried to first convert the LoRA Adapter into .gguf format and to then merge it into the base model (also in .gguf format) via llama.cpp instead of just converting the merged model but to no avail, the issue persists.

I however have not verified if repetitive behaviour also occurs in the base model after conversion. So I am not sure whether the quality loss is a general issue of the conversion process or only occurs when LoRA adapters are thrown into the mix.

unslothai / unsloth

GGUF conversion cause spelling mistakes #211