https://arxiv.org/pdf/2012.15832.pdf
This paper proposes to train Transformer LM on wiki103 by setting the max_length from small to large (two-stage training), achieving faster training speed and better ppl.
Besides, they propose a Position-Infuse-Attention(PIA) method to encode position embedding.
By replacing the relative position encoding used in Transformer-XL with PIA, absolute position encoding can be applied in Transformer-XL's cross-segment attention. Besides, PIA introduces no more params and computations. (Deberta's encoding method need more computation and params).
However, it should be noticed that the PPL degrade from 18.65 to 19.35 after replacing Transformer's self-attention to PIA.
https://arxiv.org/pdf/2012.15832.pdf This paper proposes to train Transformer LM on wiki103 by setting the max_length from small to large (two-stage training), achieving faster training speed and better ppl. Besides, they propose a Position-Infuse-Attention(PIA) method to encode position embedding. By replacing the relative position encoding used in Transformer-XL with PIA, absolute position encoding can be applied in Transformer-XL's cross-segment attention. Besides, PIA introduces no more params and computations. (Deberta's encoding method need more computation and params). However, it should be noticed that the PPL degrade from 18.65 to 19.35 after replacing Transformer's self-attention to PIA.