linkedin / Liger-Kernel

Efficient Triton Kernels for LLM Training
https://arxiv.org/pdf/2410.10989
BSD 2-Clause "Simplified" License
3.61k stars 214 forks source link

Is eager attention still required for Gemma2? #398

Closed dachenlian closed 1 week ago

dachenlian commented 1 week ago

https://github.com/linkedin/Liger-Kernel/blob/81d98ea895255a44a0c787c7afa0ab7c34e32884/src/liger_kernel/transformers/model/gemma2.py#L61

Is this warning related to FlashAttention being previously unable to support Gemma2's softcapping? If so, it was fixed in v2.6.

ByronHsu commented 1 week ago

I copy pasted gemma2 forward code and monkey patched with our FLCE layer. Seems that it was not removed yet. I will update after they make the change. (maybe they should but forget to do it)

By the way, I have seen you around when i was in NTU's gym. You lifted massive weight XD

dachenlian commented 1 week ago

Small world 😘 It's unfortunate we didn't hang out 🥺 Oh, and thank you for this amazing work!