Long Short-Term Memory as a Dynamically Computed Element-wise Weighted Sum

Abstract

present alternative view to explain the success of LSTM : `the gates themselves are versatile recurrent models that provide more representational power than previously appreciated.

LSTM/GRU was introduced to resolve the Vanishing Gradient problem in naive-RNN model
authors experiment various architectures within LSTM modules to show that gates themselves are powerful recurrent models that provide model representation power beyond simply mitigating vanishing gradient problem
Test various LSTM sub-module architectures in various sequential tasks : Language Modeling (PTB), Question Answering (SQuAD), Dependency Parsing (Universal Dependencies English Web Treebank v1.3) and Machine Translation (EnDe WMT16)
LSTM
Sub-components of LSTM can be outlined as below where content layer is Eq 2, memory cell is Eq 3~5 and output layer is Eq 6~7
Models
- LSTM - S-RNN : replace S-RNN in content layer Eq 2 with a simple linear transformation (c_t = W*x_t)
- LSTM - S-RNN - OUT : remove output gate from Eq 7, leaving only the activation function
- LSTM - S-RNN - Hidden : each gate is computed only by x_t. It can be seen as a type of QRNN or SRU
- LSTM - Gates : ablate gates and isolate S-RNN
  Results
Performance degrades most when GATES are missing
when S-RNN, OUT, HIDDEN is missing, performance drop is not significant

LSTM - S-RNN - OUT can be interpreted as a weighted sum of context-independent functions of the inputs, showing a link between self-attention
Three key differences
- LSTM weights are vectors, self-attention computes scalar weights
- LSTM weighted sum is accumulated with a dynamic programming, where self-attention is computed at once
- LSTM weights can grow up to sequence length, whereas attention has probabilistic interpretation

Detailed ablation study on wide range of sequential tasks provides a solid evidence of their claims
Interpretation and linkage between attention was surprising