propose additional attentive recurrent network(ARN) to Transformer encoder to leverage the strengths of both attention and recurrent networks
WMT14 EnDe and WMT17 ZhEn demonstrates the effectiveness
study reveals that a short-cut bridge of shallow ARN outperforms deep counterpart
Details
Main Approach
use additional recurrent encoder to the source side
recurrent model can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state
![Uploading Screen Shot 2019-04-15 at 10.31.05 AM.png…]()
Impact of Components
ablation study on size of addition recurrent encoder
smaller BiARN encoder attached directly to top of decoder outperforms all others
ablation study on number of recurrent steps in ARN
~8 seems optimal
ablation study on how to integrate representation in decoder side
stack on top outperformed all others
Overall Result
with additional ARN encoder, BLEU scores improve with statistical significance
Linguistic Analysis
what linguistic characteristics are models learning?
1-Layer BiARN performs better on all syntactic and some semantic tasks
List of Linguistic Characteristics
SeLen : sentence length
WC : recover original words given its source embedding
TrDep : check whether encoder infers the hierarchical structure of sentences
ToCo : classify in terms of the sequence of top constituents
BShif : tests whether two consecutive tokens are inverted
Tense : predict tense of the main-clause verb
SubN : number of main-clause subject
ObjN : number of direct object of the main clause
SoMo : check whether some sentences are modified by replacing a random noun or verb
CoIn : two coordinate clauses with half the sentence inverted
Personal Thoughts
Translation requires a complicated encoding function in source side. Pros of attention, rnn, cnn can be complemented to produce richer representation
this paper showed that there is a small room of improvement for rnn encoder to play part in Transformer encoder with short-cut trick
Abstract
attentive recurrent network
(ARN) to Transformer encoder to leverage the strengths of both attention and recurrent networksDetails
Main Approach
recurrent encoder
to the source siderecurrent model
can be simple (a) RNN, GRU, LSTM or (b) Attentive Recurrent Network where context representation is generated via attention with previous hidden state ![Uploading Screen Shot 2019-04-15 at 10.31.05 AM.png…]()Impact of Components
ablation study on size of addition recurrent encoder
ablation study on number of recurrent steps in ARN
ablation study on how to integrate representation in decoder side
Overall Result
Linguistic Analysis
1-Layer BiARN
performs better on all syntactic and some semantic tasksPersonal Thoughts
Link : https://arxiv.org/pdf/1904.03092v1.pdf Authors : Hao et al. 2019