Raise a question : "Why is RNN always the 'to-go' architecture for sequence modeling? CNN can also be considered as a default"
Propose a simple generic Temporal Convolution Network (TCN) composed with dilations and residual connections
Empirically show that TCN outperforms baseline RNNs (LSTM, GRU, RNN) on a variety of sequence modeling task. RNNs perform better when specialized regularization, modules or training technique is applied on top.
RNN can have an "infinite memory" in theory, but TCN has longer effective history size in practice
Details
Introduction
Briefly mentions the history of RNN and CNN
CNN
[Hinton 1989, LeCun 1995] early convolution models with 1-D conv filters over the data
[Oord 2016, Gehring 2017] recently, CNNs are applied in sequence modeling such as WaveNet for speech recognition, ConvS2S for machine translation with SoTA performance
RNN
[Elman 1990] early works on sequence modeling is dominated by RNN due to its theoretical capability of infinite memory
[Schmidhuber 1997] LSTM further pushed the dominance of RNN by resolving vanishing gradient problem
[Cho 2014] GRU as well
Convolutional Sequence Modeling
Characteristics of TCN
variable length input and outputs sequence of the same length
convolutions are causal
Dilations
Residual
Pros and Cons of TCN
Pros : Parallelism, Flexible Receptive Field Size, Stable Gradient, Low Memory for Training
Cons : Data Storage in Evaluation, Potential Parameter Change for a Transfer of Domain
Experiments and Results
Tasks
Adding Problem : 2-D binary input, predicting the sum
Sequential MNIST and P-MNIST : images presented in 784 x 1 sequence, predicting the digit
Copy Memory Task : copy first 10 digits after noise
PennTreebank : char-, word-level language modeling
Wikitext-103 : 110x larger than PTB for language modeling
LAMBADA : data for textual understanding, predicting the last word in the incomplete sentence
Result : mixed result, gating is effective for language modeling
Personal Thoughts
Reviews from OpenReview
Choice of Data is important : One review noted that specific dataset (seq MNIST, P-MNIST) were made to elaborately show the weakness of RNN. Drawing a conclusion from such dataset must be followed by a correct knowledge of origin of dataset.
Related Works must be thorough : Harsh reviews on the quality of related works, whether relevant works are cited or not.
Good empirical paper, presenting thorough experimentations on proving generic TCN model can be used for sequence models. Hope they used this model on MT task.
Abstract
Details
Introduction
Convolutional Sequence Modeling
Pros and Cons of TCN
Experiments and Results
Memory Size of TCN and RNNs
Effect of Residuals on TCN
Effect of Gating Mechanism
Personal Thoughts
Link : https://openreview.net/pdf?id=rk8wKk-R- Authors : Bai et al. 2017