guolinke / TUPE

Transformer with Untied Positional Encoding (TUPE). Code of paper "Rethinking Positional Encoding in Language Pre-training". Improve existing models like BERT.
MIT License
250 stars 26 forks source link

Definition of two learnable parameters #22

Open tom68-ll opened 2 years ago

tom68-ll commented 2 years ago

Hi Author, We do not quite understand about the definition of the two learnable parameters [{\theta _1}] and [{\theta _2}] in Figure 4 in the following way: https://github.com/guolinke/TUPE/blob/4c64ff748a7039be4918429f05bbb43d81357107/fairseq/modules/transformer_sentence_encoder.py#L234 https://github.com/guolinke/TUPE/blob/4c64ff748a7039be4918429f05bbb43d81357107/fairseq/modules/transformer_sentence_encoder.py#L236 We would appreciate it if you could explain it to us.

guolinke commented 2 years ago

The calculation of them in merged in the above bmm. And 0,0 is p0 \dot p0, 1,1 is p1 \dot p1.

tom68-ll commented 2 years ago

Thank you very much for your patient answer! But I don't quite understand why [p_1 \dot p_1] is used to represent [others to cls]. Could you please explain it again for me? Thank you again!

guolinke commented 2 years ago

oh, in paper, the \theta_1 and \theta_2 are for better demonstration, they are actually calculated by the p0 and p1. I think you understand p0 \dot p0. p1 \dot p1 -> [1, 1]. then, abs_pos_bias = abs_pos_bias[:, 1:, 1:]. so [1, 1] becomes [0, 0]. Then, the first column [:, 0] is set as the value at [0, 0].

Romain3Ch216 commented 8 months ago

Hi,

Following on from the above question, could you please explain why the final size of the position-only correlation matrix is L x L? I thought that it was a (L+1) x (L+1) matrix with first row and first columns filled with \theta_1 and \theta_2, respectively, to encode the "[CLS] to others" and "others to [CLS]" correlations.

EDIT: ok I got it, the notation seq_len in your code is equal to L + 1, where L is the number of tokens without the [CLS] token.

Thank you, Romain