Closed ganler closed 5 months ago
Hi @ganler, we simply change the rope_theta
to 16M by following this post. It would be interesting to use dynamic rope scaling without training for Llama3 models. We'll consider adding the results later.
Thank you! We tried to use dynamic RoPE scaling and it significantly improved Llama3 models (https://evalplus.github.io/repoqa.html).
Do you have any hints of why using 16M rope theta can also work much better? Thanks!
Not sure what dynamic RoPE scaling techniques you are referring to. In Hugginface, we have dynamic NTK scaling here to dynamically increase the RoPE base based on the input sequence length, which is similar to directly change rope theta with a large value. For why using a large base is useful, there are plenty papers investigating some tricks to change RoPE. https://arxiv.org/pdf/2309.16039 https://arxiv.org/pdf/2310.05209 https://arxiv.org/pdf/2309.00071 https://arxiv.org/pdf/2402.13753
Dynamic RoPE scaling: https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/
basically dynamically adjusting the scale factor to context_len/model_len
when context_len > model_len
. It seems to be the same thing your code is showing.
dynamically increase the RoPE base based on the input sequence length, which is similar to directly change rope theta with a large value
I don't quite see the similarity but thanks for the references and I will check it. Thank you!
Thanks for the great work!
From the README:
Can I know the reason for adjusting
rope_theta
here rather than directly using say dynamic rope scaling? Thanks!