Closed danielkeysers closed 1 hour ago
Hey! A PR is being opened on the hub to fix it!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
In the meantime, the google-deepmind/gemma repository was also updated for gemma2 and also uses head_dim (i.e. 256) for the query_preattn[scalar/norm]: https://github.com/google-deepmind/gemma/blob/a0504162f99a1c238efb37b8197e711c0f3808fd/gemma/transformer.py#L195
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
System Info
(Sorry, I'm relatively new to github, please let me knowif this is not the right route to discuss this.)
[I think system info should not be relevant to this issue.]
Who can help?
@ArthurZucker
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
No specific "reproduction", but I think the value for "query_pre_attn_scalar" at https://github.com/huggingface/transformers/blob/cffa2b9c1dd825df1c0e949b99aaef1655c28625/src/transformers/models/gemma2/convert_gemma2_weights_to_hf.py#L64 should probably be changed from 224 to 256 to mirror the change in gemma_pytorch (https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e) that happened recently.
Expected behavior
The differences in observed behavior are likely small given the small change in this scaling factor.