huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.31k stars 26.35k forks source link

Gemma-2 9B query_pre_attn_scalar value #31891

Closed danielkeysers closed 1 hour ago

danielkeysers commented 2 months ago

System Info

(Sorry, I'm relatively new to github, please let me knowif this is not the right route to discuss this.)

[I think system info should not be relevant to this issue.]

Who can help?

@ArthurZucker

Information

Tasks

Reproduction

No specific "reproduction", but I think the value for "query_pre_attn_scalar" at https://github.com/huggingface/transformers/blob/cffa2b9c1dd825df1c0e949b99aaef1655c28625/src/transformers/models/gemma2/convert_gemma2_weights_to_hf.py#L64 should probably be changed from 224 to 256 to mirror the change in gemma_pytorch (https://github.com/google/gemma_pytorch/commit/03e657582d17cb5a8617ebf333c1c16f3694670e) that happened recently.

Expected behavior

The differences in observed behavior are likely small given the small change in this scaling factor.

ArthurZucker commented 2 months ago

Hey! A PR is being opened on the hub to fix it!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

danielkeysers commented 1 month ago

In the meantime, the google-deepmind/gemma repository was also updated for gemma2 and also uses head_dim (i.e. 256) for the query_preattn[scalar/norm]: https://github.com/google-deepmind/gemma/blob/a0504162f99a1c238efb37b8197e711c0f3808fd/gemma/transformer.py#L195

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.