Question regarding RoPE implementation

lxy1108 commented 6 days ago

Thanks for sharing the code of this interesting work!

I noticed an inconsistency between the paper’s description of the RoPE and its implementation in the code. According to the paper, the relative position should be calculated based on the temporal differences between patches,

However, the code seems to be using flattened 2D indices instead when applying RoPE:

seq_id = torch.arange(n_tokens * n_vars)
seq_id = repeat(seq_id, 'n -> b h n', b=B, h=H)
queries, keys = self.qk_proj(queries, keys, query_id=seq_id, kv_id=seq_id)

Could you clarify the reasoning behind this discrepancy? Was this an intentional change, or might it affect the model’s performance?

WenWeiTHU commented 6 days ago

Update: changing "permutation-invariant between variables" to "permutation-equivariant between variables" Thanks for your valuable issue, the current 2d RoPE implementation in this repo is flattened on 2D indices (which is mainly to be compatible with the implementation of Moirai's official repo).

After our discussion, we believe that it is reasonable to flatten out one-dimensional indices of RoPE to ensure the permutation-equivariant between variables. We are arranging relevant experiments, and we'll see how that affects the performance. Please stay tuned and thanks again for your insightful question.

lxy1108 commented 6 days ago

Thanks for your response! I am looking forward to seeing the relevant experiments :)

lxy1108 commented 6 days ago

Thanks for your valuable issue, the current 2d RoPE implementation in this repo is flattened on 2D indices (which is mainly to be compatible with the implementation of Moirai's official repo).

After our discussion, we believe that it is reasonable to flatten out one-dimensional indices of RoPE to ensure the permutation-invariance between variables. We are arranging relevant experiments, and we'll see how that affects the performance. Please stay tuned and thanks again for your insightful question.

Thank you again for the detailed response! Could you elaborate on how this flattened RoPE achieves permutation invariance between variables?

WenWeiTHU commented 2 days ago

Update: Sorry for the misunderstanding, we have changed the "permutation-invariant for variables" to "permutation-equivariant for variables" (For their difference, please refer to https://arxiv.org/abs/1703.06114). We will also revise this term in our paper.

This is a very interesting and important question!

Note that permutation equivariant between variables means shuffling the input order of variables should not affect anything other than the output order of variables.

For example, if we have $T$ patch tokens for $N$ variables each. Now, we are going to mark the temporal position of these $NT$ tokens. Instead of using 2D RoPE: $R{NT, NT}$, we think it is more reasonable to use 1D $R{T, T}$ and repeat it by using the Kronecker Product: $I{N\times N} \bigotimes R{T, T}$. And thus $N$ will not influence the element values in varying sub-matrice (working for different variable pairs).

WenWeiTHU commented 2 days ago

Further, to mark the variable position of these $NT$ tokens, what we do is only distinguish whether the token belongs to the variable or not. More concisely, if token A belongs to variable 1 and does not belong to variable 2 and variable 3, we only care that V1 is different from V2/V3, but we are not concerned that V2 is different from V3 (if we do so, change the input order of V2 and V3 will cause additional effect to the output). Therefore, for any token, confirming its endogenous and exogenous variables is enough. That is why we use the term "keep the equivariance of variables" by $u$ and $v$ in our work.

thuml / OpenLTM

Question regarding RoPE implementation #10