Open almugabo opened 1 week ago
Actually, in inspiration of one current kaggle competition - it is really good idea too add this pretty soon.
Thanks @almugabo for creating the issue. I think this will be a bit of effort, quickly jotting down a couple of things I'm aware of that we'd need to support:
attn_scale
and mlp_scale
)For logit softcapping and sliding window attention, I suspect we can use FlexAttention APIs. See this blog post where they give explicit examples of each.
Hello, I have started the addition of gemma2, I will create now my PR in WIP mode. I haven't run any test yet but will do soon!
Edit: My PR is here: #1835
@ebsmothers it would be great if you could have a quick look to validate the choices I made in order to implement sliding windows, pre-post layer normalisation and softcapping... I would be happy to make things differently to keep the changes minimal (I tried as much as possible to keep all changes minimal).
I didn't know about FlexAttention, I will look into it!
Could you please add fine-tuning support for gemma-2 ? It has good good multilingual capabilities and is a good candidate for fine-tuning for languages other than English. Its different sizes also make it attractive for fine-tuning for different tasks. I would gladly help but am not knowledgeable enough Thank you