richardbaihe / paperreading

NLP papers
MIT License
2 stars 0 forks source link

NIPS 2020 | The Depth-to-Width Interplay in Self-Attention #58

Closed richardbaihe closed 3 years ago

richardbaihe commented 3 years ago

This paper investigates the depth and width of the Transformer: when the number of paramerters is large, depth is more important, while for a smaller model, width is more important.

image