Closed oKatanaaa closed 1 month ago
OHHH I forgot to say get_chat_template
is broken for Gemma :( What you're looking for is "gemma_chatml"
instead of "chatml"
, and it'll auto auto <|im_start|>
and <|im_end|>
. You also do not need to add any special tokens since I handle it internally!
My findings so far:
lm_head
and embed_tokens
have different data pointers (is that okay?). But the tensors are equal.lm_head
and embed_tokens
in the merged model are equal to lm_head
and embed_tokens
in the PEFT model.So the problem is not the embeddings weights, but something else
OHHH I forgot to say
get_chat_template
is broken for Gemma :( What you're looking for is"gemma_chatml"
instead of"chatml"
, and it'll auto auto<|im_start|>
and<|im_end|>
. You also do not need to add any special tokens since I handle it internally!
In my setting it works fine, at least for PEFT model. Just in case, both models were loaded in a single notebook, so the problem can't be the tokenizer (because model
works great, but model2
is simply dead). Must be something with the merging
I looked at the contents of adapter_model.safetensors
file and found that it contains 3x more embedding weights, which explains the size. But it leaves me somewhat confused since all 3 tensors are equal to each other. Like why save all three and not just one from modules_to_save
?
I first thought it was some quirk with Gemma's LoRA saving. Checked Mistral LoRA adapter files (it weighs 2.1GB), same thing.
Might be linked to this: https://huggingface.co/google/gemma-2b/discussions/21
Although I'm confused why PEFT model works so well even with the added token.
While debugging, I found that weights for the layernorm layers in the merged model differ from those in the PEFT one.
Weights of the PEFT model:
Weights of the merged model:
It's just values from PEFT model, but +1? Is there some trick to initializing rms weights in gemma?
To clarify, I looked at the value of W
variable here:
Alright, I added this stupid ass fix (in unsloth_save_model
) and now everything works fine:
Although the outputs are not exactly the same, that's way better than before:
Hope that helps 🫡
The fix didn't quite fix it. For longer inputs the merged model still diverges. I guess huge error accumulation is in play.
Sorry for lots of messages.
I suspect the culprit is this line: https://github.com/unslothai/unsloth/blob/main/unsloth%2Fmodels%2Fgemma.py#L362
Although I understand the motivation, there are two problems with it:
As the most optimal solution in terms of compatibility/memory-compute savings I see a custom triton layernorm kernel. Using the vanilla implementation might as well be good enough, just need to do a couple of tests that's all
@oKatanaaa Oh my - thanks so so much for all the debugging - extremely appreciate it!! I just woke up so much apologies missed the convo - I was gonna say it's ironic I was fixing Gemma bugs but didn't check Unsloth's own issues!! 😆
Great you found the +1 culprit - I actually totally forgot to minus 1 during merging - but if according to your analysis +1 then minus 1 reduces accuracy, I'll just copy paste the kernel and add 1 - i'll do that in minutes and push it in :)
On the saving modules - interesting - I have never interacted with saving modules since I normally only finetune the rest and leave the lm_head and embedding matrix alone. I shall investigate this later today!!
Again thanks so much on the help - extremely appreciate it! I'll at you in the fix :)
Oh wait on the layernorms - do you unfreeze them to train on them?
Oh wait on the layernorms - do you unfreeze them to train on them?
Nope, didn't touch those during training. But thought it was worth pointing out potential issues
Ok I finally fixed it! I took your advice and rewrote the kernels and isolated it out. Hopefully GGUF saving works now (and merged 16bit)
Can confirm that merging in 16bit now works fine. No more degenerate outputs.
Guess the difference in responses we can attribute to rounding errors during LoRA merge (I've seen it with other models as well), I'm good with that.
Thanks for the fix, well done!
Ye it's possible it was rounding and some other issues :)
I've trained Gemma 2B in 16bit with LoRA. With adapters loaded separately everything works just fine. But after merging the adapters, the model becomes literally unusable.
On the screenshot:
model
is PEFT model with adapters as is.model2
is model with adapters merged.Here is the code used to load the models:
Model was trained with ChatML format, hence token adding stuff.
resize_model_vocab
parameter is a workaround I added to load vocab of different size.Also, saved adapters weigh 6 GB, is that alright? Note that merged model is 5.7 GB. I believe adapters should be hundred MB tops (maybe a GB with saved vocab and lm_head), but presumably the whole model got saved.
Important note: during training in
modules_to_save
param I passed["embed_tokens", "lm_head"]
to train the new ChatML tokens. Although I am not sure how that plays with the fact the Gemma'sembed_tokens
weights are tied withlm_head
(I believe?). Maybe that's actually the reason why merging fails? Like you have to pass in onlyembed_tokens
, otherwise everything will break (just a hypothesis)Dependencies: