❓ Usefulness of the opponent function

As I understood your paper, opponent attention is trained through softmin.
softmin is actually the reason why conventional attention and opponent attention are trained in an opposite fashion.

However, what's the point of the opponent function ?

If the opponent function was removed, it would be the equivalent of another attention head, trained negatively (because of softmin).
So why adding such an function, which as I understand it simply mask existing conventional attention scores (therefore loosing information ?) ?

travel-go / Abstractive-Text-Summarization

❓ Usefulness of the opponent function #3