Closed janEbert closed 1 year ago
Our paper focuses on the self-attention setting, which is evaluated by thorough experiments. For cross-attention, the migration is not complex, we can add the position embedding on the encoder output. However, we don't verify xPos's effectiveness on cross-attention, so we remove it from the official repo.
Thank you for the clarification!
Hey, I noticed compared to the old implementation at https://github.com/sunyt32/torchscale, xPos is no longer used for cross-attention between decoder inputs and encoder outputs. In the old implementation, scaling was simply inverted for that case.
Could you help me out on understanding why the change toward not using xPos (or any positional encoding, as a matter of fact) for cross-attention happened? Does this produce better results than noted in the LeX/xPos paper?
@shumingma @sunyt32