Open rohhro opened 4 months ago
LLama 3 (and 3.1) do not do well when you reduce their accuracy down to 4bit. Maybe consider using llama 2 13b or mistral 7b v0.2, those models suffer less from reducing the accuracy.
LLama 3 (and 3.1) do not do well when you reduce their accuracy down to 4bit. Maybe consider using llama 2 13b or mistral 7b v0.2, those models suffer less from reducing the accuracy.
Thanks! It could be the case.
However the base model used to finetune is bnb 4bit or load_in_4bit=True, it should lose the accuracy at the first place already, right?
Apologies on the delay - do you know if you can quantify the degradation? Like is it unusable?
Apologies on the delay - do you know if you can quantify the degradation? Like is it unusable?
No worries:)
When load the finetuned model in 16bit, it generates valid jsons as expected; when load in 4bit, it generates something look like json with some of my instructions from the prompts but invalid (cause I have a script to check the jsons) with some mix json schema (mixed with my training data and something it comes up by itself).
Basically when load in BF16 everything is fine but in 4bit it's just some half random stuff.
one workaround if you need the low VRAM usage of the 4 bit version would be to avoid merging the adapter
Ye not merging is a good workaround - a trick to maintain good accuracy is to leave the LoRA adapters as is, and just simply edit the model config to load the non quantized model
Ye not merging is a good workaround - a trick to maintain good accuracy is to leave the LoRA adapters as is, and just simply edit the model config to load the non quantized model
Thanks Daniel! Yea most of the time I just load BF16 merged model which is fine. The problem is some models are too big to load in 16bit, regardless merge or just load the lora adapter + base model, that's the moment I want to use load in 4bit.
Hmm you could just load the 4bit model, without merging as a compromise
Hmm you could just load the 4bit model, without merging as a compromise
But wouldn't it be the same as load 16bit model into 4bit (inference quality wise)? Then it's back to the loop of this post.
I'm working on merging directly with 16bit, so hopefully this'll alleviate the issue!
It's been like this for a while.
Steps to Reproduce:
Load the merged BF16 model for inference using below code snippet:
model, tokenizer = FastLanguageModel.from_pretrained( model_name=MODEL, max_seq_length=3000, dtype=torch.bfloat16, load_in_4bit=True, ) FastLanguageModel.for_inference(model)
Observe the results: The responses are of very poor quality compared to load_in_4bit=False. The rest of the code remains the same; only changing load_in_4bit shows significantly different results.