Convolutional Sequence Modeling Revisited

Abstract

Raise a question : "Why is RNN always the 'to-go' architecture for sequence modeling? CNN can also be considered as a default"
Propose a simple generic Temporal Convolution Network (TCN) composed with dilations and residual connections
Empirically show that TCN outperforms baseline RNNs (LSTM, GRU, RNN) on a variety of sequence modeling task. RNNs perform better when specialized regularization, modules or training technique is applied on top.
RNN can have an "infinite memory" in theory, but TCN has longer effective history size in practice

Details

Introduction
- Briefly mentions the history of RNN and CNN
- CNN
  - [Hinton 1989, LeCun 1995] early convolution models with 1-D conv filters over the data
  - [Oord 2016, Gehring 2017] recently, CNNs are applied in sequence modeling such as WaveNet for speech recognition, ConvS2S for machine translation with SoTA performance
- RNN
  - [Elman 1990] early works on sequence modeling is dominated by RNN due to its theoretical capability of infinite memory
  - [Schmidhuber 1997] LSTM further pushed the dominance of RNN by resolving vanishing gradient problem
  - [Cho 2014] GRU as well
Convolutional Sequence Modeling
- Characteristics of TCN
- variable length input and outputs sequence of the same length
- convolutions are causal
- Dilations
- Residual
Pros and Cons of TCN
- Pros : Parallelism, Flexible Receptive Field Size, Stable Gradient, Low Memory for Training
- Cons : Data Storage in Evaluation, Potential Parameter Change for a Transfer of Domain
Experiments and Results
- Tasks
- Adding Problem : 2-D binary input, predicting the sum
- Sequential MNIST and P-MNIST : images presented in 784 x 1 sequence, predicting the digit
- Copy Memory Task : copy first 10 digits after noise
- PennTreebank : char-, word-level language modeling
- Wikitext-103 : 110x larger than PTB for language modeling
- LAMBADA : data for textual understanding, predicting the last word in the incomplete sentence
Memory Size of TCN and RNNs
- using Copy Memory Task, TCN shows perfect memory skill, whereas RNN fails
Effect of Residuals on TCN
- Result : residual is always good
Effect of Gating Mechanism
- Result : mixed result, gating is effective for language modeling

Personal Thoughts

Reviews from OpenReview
- Choice of Data is important : One review noted that specific dataset (seq MNIST, P-MNIST) were made to elaborately show the weakness of RNN. Drawing a conclusion from such dataset must be followed by a correct knowledge of origin of dataset.
- Related Works must be thorough : Harsh reviews on the quality of related works, whether relevant works are cited or not.
Good empirical paper, presenting thorough experimentations on proving generic TCN model can be used for sequence models. Hope they used this model on MT task.

Link : https://openreview.net/pdf?id=rk8wKk-R- Authors : Bai et al. 2017

kweonwooj / papers

Convolutional Sequence Modeling Revisited #74

Abstract

Details

Personal Thoughts