richardbaihe / paperreading

NLP papers
MIT License
2 stars 0 forks source link

Arxiv 2021 | Shortformer: Better Language Modeling using Shorter Inputs #56

Closed richardbaihe closed 3 years ago

richardbaihe commented 3 years ago

https://arxiv.org/pdf/2012.15832.pdf This paper proposes to train Transformer LM on wiki103 by setting the max_length from small to large (two-stage training), achieving faster training speed and better ppl. Besides, they propose a Position-Infuse-Attention(PIA) method to encode position embedding. By replacing the relative position encoding used in Transformer-XL with PIA, absolute position encoding can be applied in Transformer-XL's cross-segment attention. Besides, PIA introduces no more params and computations. (Deberta's encoding method need more computation and params). However, it should be noticed that the PPL degrade from 18.65 to 19.35 after replacing Transformer's self-attention to PIA.

image image