Hello!
I was truly impressed by the paper "A Length-Extrapolatable Transformer". I am most interested in training LLMs designed for large sequence length numbers. You have pointed out an issue of the RoPE that it tends to be unstable as relative distance between the two tokens gets bigger, which leads up to degradation of precision. For the purpose of regularization you have introduced a factor for a function of positional encoding:
$$gζ [n] = \sum{i=0}^{d/2}\cos{nθ_i}ζ_i^n$$, where $ζ_i\in[0,1]$. Here, I have an unresolved question, for which I could not find the answer in the paper or in the code. The thing is that n can be negative, because n is actually some $\hat{m}-\hat{n}$, where $\hat{m}$ is position in Q, and $\hat{n}$ is one in K, so n runs thourgh the interval (-s, s). Because $ζ_i\in[0,1]$, for negative n and bigger relative distances the function $g_ζ [n]$ must grow unlimitedly. Surely, we cannot take the absolute value of n, because in that case we are discarding the important property that the positional embedding does not depend on the offset. Can you please explain how you go about this issue?
Thank you!
Hello! I was truly impressed by the paper "A Length-Extrapolatable Transformer". I am most interested in training LLMs designed for large sequence length numbers. You have pointed out an issue of the RoPE that it tends to be unstable as relative distance between the two tokens gets bigger, which leads up to degradation of precision. For the purpose of regularization you have introduced a factor for a function of positional encoding: $$gζ [n] = \sum{i=0}^{d/2}\cos{nθ_i}ζ_i^n$$, where $ζ_i\in[0,1]$. Here, I have an unresolved question, for which I could not find the answer in the paper or in the code. The thing is that n can be negative, because n is actually some $\hat{m}-\hat{n}$, where $\hat{m}$ is position in Q, and $\hat{n}$ is one in K, so n runs thourgh the interval (-s, s). Because $ζ_i\in[0,1]$, for negative n and bigger relative distances the function $g_ζ [n]$ must grow unlimitedly. Surely, we cannot take the absolute value of n, because in that case we are discarding the important property that the positional embedding does not depend on the offset. Can you please explain how you go about this issue? Thank you!