keras-team / keras-nlp

Modular Natural Language Processing workflows with Keras
Apache License 2.0
734 stars 216 forks source link

Fix rope scaling factor #1605

Closed abuelnasr0 closed 2 months ago

abuelnasr0 commented 2 months ago

I think the use of scaling_factor is wrong in RotaryEmbedding layer. It is used to scale the positions not the frequencies.

References:

mattdangerw commented 2 months ago

@abuelnasr0 also apologies, we just changed our entire directory structure in https://github.com/keras-team/keras-nlp/pull/1608

(Hopefully for good reason, we want to allow pip install -e . and pip install git+https:// while still keeping our explicit API surface)

But it does mean everything will need an annoying merge/rebase. If it'd help for me to do any of those and push to this branch just lmk!

abuelnasr0 commented 2 months ago

Can we add some tests?

I can add tests, but on sunday. sorry for that, but I will be AFK until then.

mattdangerw commented 2 months ago

I can add tests, but on sunday. sorry for that, but I will be AFK until then.

No rush at all! And thanks so much for all the major contributions to the library :)

I am just getting back from vacation myself, slowly catching up on all the review.

abuelnasr0 commented 2 months ago

thanks so much for all the major contributions to the library :)

You're welcome. I am trying to give back to the community as much as I can. And actually contributing to the library helped me to improve, I am learning new things with each PR. Thank you & other authors for creating the library. and thank you for all your reviews, they were really helpful to me.

abuelnasr0 commented 2 months ago

https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L96-L174

checkout these lines. what is implemented in this PR is LlamaLinearScalingRotaryEmbedding. and I think it was implemented in the main layer at first, but they decided to move it to a new layer. there's also LlamaDynamicNTKScalingRotaryEmbedding that uses scaling_factor in another way. but I think LlamaLinearScalingRotaryEmbedding is more popular.