Method

Self-attention in the Transformer can be interpreted as a fully connected graph, where nodes are all tokens.

This paper proposes a new edge connection rule:

By binary partitioning the input sequence, a binary tree can be build as above. Then, the edges can be connected as below(affliated nodes and context nodes).

Finally, relative position encoding is introduced to each layer:

richardbaihe / paperreading

Arxiv 2019 | BP-Transformer: Modeling Long-Range Context via Binary Partitioning #32

Method

Results

Languge model

Document Translation