Inference with a BF16 model, finetuned and merged with Unsloth, using load_in_4bit=True results in very poor responses

unslothai / unsloth

Finetune Llama 3.2, Mistral, Phi, Qwen 2.5 & Gemma LLMs 2-5x faster with 80% less memory

https://unsloth.ai

Apache License 2.0

18.77k stars 1.32k forks source link

Inference with a BF16 model, finetuned and merged with Unsloth, using load_in_4bit=True results in very poor responses #800

Open rohhro opened 4 months ago

rohhro commented 4 months ago

It's been like this for a while.

Steps to Reproduce:

Finetune llama3 (or 3.1, same behavior) 8b using load_in_4bit=True.
Merge to a BF16 model after finetuning.
Load the merged BF16 model for inference using below code snippet:

model, tokenizer = FastLanguageModel.from_pretrained( model_name=MODEL, max_seq_length=3000, dtype=torch.bfloat16, load_in_4bit=True, ) FastLanguageModel.for_inference(model)

Observe the results: The responses are of very poor quality compared to load_in_4bit=False. The rest of the code remains the same; only changing load_in_4bit shows significantly different results.

DaddyCodesAlot commented 4 months ago

LLama 3 (and 3.1) do not do well when you reduce their accuracy down to 4bit. Maybe consider using llama 2 13b or mistral 7b v0.2, those models suffer less from reducing the accuracy.

rohhro commented 4 months ago

LLama 3 (and 3.1) do not do well when you reduce their accuracy down to 4bit. Maybe consider using llama 2 13b or mistral 7b v0.2, those models suffer less from reducing the accuracy.

Thanks! It could be the case.

However the base model used to finetune is bnb 4bit or load_in_4bit=True, it should lose the accuracy at the first place already, right?

danielhanchen commented 4 months ago

Apologies on the delay - do you know if you can quantify the degradation? Like is it unusable?

rohhro commented 4 months ago

Apologies on the delay - do you know if you can quantify the degradation? Like is it unusable?

No worries:)

When load the finetuned model in 16bit, it generates valid jsons as expected; when load in 4bit, it generates something look like json with some of my instructions from the prompts but invalid (cause I have a script to check the jsons) with some mix json schema (mixed with my training data and something it comes up by itself).

Basically when load in BF16 everything is fine but in 4bit it's just some half random stuff.

timothelaborie commented 4 months ago

one workaround if you need the low VRAM usage of the 4 bit version would be to avoid merging the adapter

danielhanchen commented 4 months ago

Ye not merging is a good workaround - a trick to maintain good accuracy is to leave the LoRA adapters as is, and just simply edit the model config to load the non quantized model

rohhro commented 4 months ago

Ye not merging is a good workaround - a trick to maintain good accuracy is to leave the LoRA adapters as is, and just simply edit the model config to load the non quantized model

Thanks Daniel! Yea most of the time I just load BF16 merged model which is fine. The problem is some models are too big to load in 16bit, regardless merge or just load the lora adapter + base model, that's the moment I want to use load in 4bit.

danielhanchen commented 4 months ago

Hmm you could just load the 4bit model, without merging as a compromise

rohhro commented 3 months ago

Hmm you could just load the 4bit model, without merging as a compromise

But wouldn't it be the same as load 16bit model into 4bit (inference quality wise)? Then it's back to the loop of this post.

danielhanchen commented 3 months ago

I'm working on merging directly with 16bit, so hopefully this'll alleviate the issue!

unslothai / unsloth

Inference with a BF16 model, finetuned and merged with Unsloth, using load_in_4bit=True results in very poor responses #800

Load the merged BF16 model for inference using below code snippet:

model, tokenizer = FastLanguageModel.from_pretrained( model_name=MODEL, max_seq_length=3000, dtype=torch.bfloat16, load_in_4bit=True, ) FastLanguageModel.for_inference(model)