richardbaihe / paperreading

NLP papers
MIT License
2 stars 0 forks source link

Arxiv 2019 | BP-Transformer: Modeling Long-Range Context via Binary Partitioning #32

Closed richardbaihe closed 4 years ago

richardbaihe commented 4 years ago

https://github.com/yzh119/BPT

Method

Self-attention in the Transformer can be interpreted as a fully connected graph, where nodes are all tokens.

image

This paper proposes a new edge connection rule:

image

By binary partitioning the input sequence, a binary tree can be build as above. Then, the edges can be connected as below(affliated nodes and context nodes).

image image

Finally, relative position encoding is introduced to each layer:

image

Results

Languge model

image

Document Translation

image