xPos cross-attention change

microsoft / torchscale

Foundation Architecture for (M)LLMs

https://aka.ms/GeneralAI

MIT License

3.01k stars 202 forks source link

xPos cross-attention change #28

Closed janEbert closed 1 year ago

janEbert commented 1 year ago

Hey, I noticed compared to the old implementation at https://github.com/sunyt32/torchscale, xPos is no longer used for cross-attention between decoder inputs and encoder outputs. In the old implementation, scaling was simply inverted for that case.

Could you help me out on understanding why the change toward not using xPos (or any positional encoding, as a matter of fact) for cross-attention happened? Does this produce better results than noted in the LeX/xPos paper?

@shumingma @sunyt32

sunyt32 commented 1 year ago

Our paper focuses on the self-attention setting, which is evaluated by thorough experiments. For cross-attention, the migration is not complex, we can add the position embedding on the encoder output. However, we don't verify xPos's effectiveness on cross-attention, so we remove it from the official repo.

janEbert commented 1 year ago

Thank you for the clarification!