microsoft / torchscale

Foundation Architecture for (M)LLMs
https://aka.ms/GeneralAI
MIT License
2.98k stars 201 forks source link

Question on decay factor for attention with xPos #66

Closed mvbakulin closed 10 months ago

mvbakulin commented 10 months ago

Hello! I was truly impressed by the paper "A Length-Extrapolatable Transformer". I am most interested in training LLMs designed for large sequence length numbers. You have pointed out an issue of the RoPE that it tends to be unstable as relative distance between the two tokens gets bigger, which leads up to degradation of precision. For the purpose of regularization you have introduced a factor for a function of positional encoding: $$gζ [n] = \sum{i=0}^{d/2}\cos{nθ_i}ζ_i^n$$, where $ζ_i\in[0,1]$. Here, I have an unresolved question, for which I could not find the answer in the paper or in the code. The thing is that n can be negative, because n is actually some $\hat{m}-\hat{n}$, where $\hat{m}$ is position in Q, and $\hat{n}$ is one in K, so n runs thourgh the interval (-s, s). Because $ζ_i\in[0,1]$, for negative n and bigger relative distances the function $g_ζ [n]$ must grow unlimitedly. Surely, we cannot take the absolute value of n, because in that case we are discarding the important property that the positional embedding does not depend on the offset. Can you please explain how you go about this issue? Thank you!

sunyt32 commented 10 months ago

In xPos paper, we just focus on the unidirectional model, where $n$ is always non-negative.