Open tom68-ll opened 2 years ago
The calculation of them in merged in the above bmm. And 0,0 is p0 \dot p0, 1,1 is p1 \dot p1.
Thank you very much for your patient answer! But I don't quite understand why [p_1 \dot p_1] is used to represent [others to cls]. Could you please explain it again for me? Thank you again!
oh, in paper, the \theta_1 and \theta_2 are for better demonstration, they are actually calculated by the p0 and p1. I think you understand p0 \dot p0. p1 \dot p1 -> [1, 1]. then, abs_pos_bias = abs_pos_bias[:, 1:, 1:]. so [1, 1] becomes [0, 0]. Then, the first column [:, 0] is set as the value at [0, 0].
Hi,
Following on from the above question, could you please explain why the final size of the position-only correlation matrix is L x L? I thought that it was a (L+1) x (L+1) matrix with first row and first columns filled with \theta_1 and \theta_2, respectively, to encode the "[CLS] to others" and "others to [CLS]" correlations.
EDIT: ok I got it, the notation seq_len in your code is equal to L + 1, where L is the number of tokens without the [CLS] token.
Thank you, Romain
Hi Author, We do not quite understand about the definition of the two learnable parameters [{\theta _1}] and [{\theta _2}] in Figure 4 in the following way: https://github.com/guolinke/TUPE/blob/4c64ff748a7039be4918429f05bbb43d81357107/fairseq/modules/transformer_sentence_encoder.py#L234 https://github.com/guolinke/TUPE/blob/4c64ff748a7039be4918429f05bbb43d81357107/fairseq/modules/transformer_sentence_encoder.py#L236 We would appreciate it if you could explain it to us.