Self-attention in the Transformer can be interpreted as a fully connected graph, where nodes are all tokens.
This paper proposes a new edge connection rule:
By binary partitioning the input sequence, a binary tree can be build as above.
Then, the edges can be connected as below(affliated nodes and context nodes).
Finally, relative position encoding is introduced to each layer:
https://github.com/yzh119/BPT
Method
Self-attention in the Transformer can be interpreted as a fully connected graph, where nodes are all tokens.
This paper proposes a new edge connection rule:
By binary partitioning the input sequence, a binary tree can be build as above. Then, the edges can be connected as below(affliated nodes and context nodes).
Finally, relative position encoding is introduced to each layer:
Results
Languge model
Document Translation