Open chaojiewang94 opened 9 months ago
I appreciate your interest in our work.
Guangxuan
Thanks for your update, please allow me to ask some further questions
I am just a little curious about why this phenomenon will also happen on the attention at the first layer? The token embedding of the input first layer is the composition of position embedding and word (semantic) embedding. If the change of initial words (word embeddings) and the remove of position embedding (as you said) do not affect the conclusion, I do understand why ``attention sink''will occur in the first layer, it may be more like a uniform distributed attention
Thanks for your awesome work. I have some questions about the concept of initial tokens and the implementation of learnable initial tokens.
So, can I understand this phenomenon as that the position embeddings of the first (four) tokens play an importnat role in attracting the attention weights durring the generation of the following tokens?
Thanks