present alternative view to explain the success of LSTM : `the gates themselves are versatile recurrent models that provide more representational power than previously appreciated.
Details
LSTM/GRU was introduced to resolve the Vanishing Gradient problem in naive-RNN model
authors experiment various architectures within LSTM modules to show that gates themselves are powerful recurrent models that provide model representation power beyond simply mitigating vanishing gradient problem
Test various LSTM sub-module architectures in various sequential tasks : Language Modeling (PTB), Question Answering (SQuAD), Dependency Parsing (Universal Dependencies English Web Treebank v1.3) and Machine Translation (EnDe WMT16)
LSTM
Sub-components of LSTM can be outlined as below where content layer is Eq 2, memory cell is Eq 3~5 and output layer is Eq 6~7
Models
LSTM - S-RNN : replace S-RNN in content layer Eq 2 with a simple linear transformation (c_t = W*x_t)
LSTM - S-RNN - OUT : remove output gate from Eq 7, leaving only the activation function
LSTM - S-RNN - Hidden : each gate is computed only by x_t. It can be seen as a type of QRNN or SRU
LSTM - Gates : ablate gates and isolate S-RNN
Results
Performance degrades most when GATES are missing
when S-RNN, OUT, HIDDEN is missing, performance drop is not significant
Discussion
LSTM - S-RNN - OUT can be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attention
Three key differences
LSTM weights are vectors, self-attention computes scalar weights
LSTM weighted sum is accumulated with a dynamic programming, where self-attention is computed at once
LSTM weights can grow up to sequence length, whereas attention has probabilistic interpretation
Personal Thoughts
Detailed ablation study on wide range of sequential tasks provides a solid evidence of their claims
Interpretation and linkage between attention was surprising
Abstract
Details
Vanishing Gradient
problem in naive-RNN modelLSTM
LSTM - S-RNN
: replace S-RNN in content layer Eq 2 with a simple linear transformation (c_t = W*x_t
)LSTM - S-RNN - OUT
: remove output gate from Eq 7, leaving only the activation functionLSTM - S-RNN - Hidden
: each gate is computed only byx_t
. It can be seen as a type of QRNN or SRULSTM - Gates
: ablate gates and isolate S-RNNResults
GATES
are missingS-RNN, OUT, HIDDEN
is missing, performance drop is not significantDiscussion
LSTM - S-RNN - OUT
can be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attentionPersonal Thoughts
Link : https://arxiv.org/pdf/1805.03716.pdf Authors : Levy et al. 2018