Closed markroxor closed 7 years ago
Abstract -
Introduction -
Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.
Sequences pose a challenge forDNNs because they require that the dimensionality of the inputs and outputs is known and fixed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.
On theWMT’14 English to French translation task, we obtained a BLEU score of 34.81 by directly extracting translations from an ensemble of 5 deep LSTMs (with 384Mparameters and 8,000 dimensional state each) using a simple left-to-right beam-search decoder. This is by far the best result achieved by direct translation with large neural net-works. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81 BLEU score was achieved by an LSTMwith a vocabulary of 80k words, so the score was penalized whenever the reference translation contained a word not covered by these 80k. This result shows that a relatively unoptimized small-vocabulary neural network architecture which has much room for improvement outperforms a phrase-based SMT system.
Surprisingly, the LSTMdid not suffer on very long sentences, despite the recent experience of other researchers with related architectures [26]. We were able to do well on long sentences because we reversed the order of words in the source sentence but not the target sentences in the training and test set. By doing so, we introducedmany short term dependencies that made the optimization problem much simpler (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with long sentences. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.
The Recurrent Neural Network (RNN) [31, 28] is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs (x1, . . . ,xT ), a standard RNN computes a sequence of outputs (y1, . . . , yT ) by iterating the following equation: ht = sigm?Whxxt +Whhht−1? yt = Wyhht The RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengthswith complicated and non-monotonic relation- ships.
The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT′ |x1, . . . ,xT ) where (x1, . . . ,xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequencewhose length The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT′ |x1, . . . ,xT ) where (x1, . . . ,xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequencewhose length T′ may differ fromT. The LSTMcomputes this conditional probability by first obtaining the fixed-(x1, . . . ,xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequencewhose length T′ may differ fromT. The LSTMcomputes this conditional probability by first obtaining the fixed-dimensional representation v of the input sequence (x1, . . . ,xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation ,xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation dimensional representation v of the input sequence (x1, . . . ,xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, . . . ,xT :
Our actual models differ from the above description in three important ways. First, we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTMonmultiple language pairs simultaneously [18]. Second,we found that deep LSTMs significantly outperformed shallowLSTMs, sowe chose an LSTMwith four layers. Third,we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a, b, c to the sentence α,β, γ, the LSTM is asked to map c, b, a to α,β, γ, where α,β, γ is the translation of a, b, c. Thisway, a is in close proximity to α, b is fairly close to β, and so on, a fact thatmakes it easy for SGD to “establish communication” between the input and the output. We found this simple data transformation to greatly improve the performance of the LSTM.
3.3 Reversing the Source Sentences
3.4 Training details
3.5 Parallelization
https://arxiv.org/pdf/1409.3215.pdf