Sequence to Sequence Learning with Neural Networks

markroxor commented 7 years ago

https://arxiv.org/pdf/1409.3215.pdf

markroxor commented 7 years ago

Abstract -

Although DNNs work well whenever large labeled training sets are available, they cannot be used to map sequences to sequences.
Our method uses amultilayered Long Short-TermMemory (LSTM) tomap the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector.
Our main result is that on an English to French translation task fromtheWMT’14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set, where the LSTM’s BLEU score was penalized on out-of-vocabularywords. Additionally, the LSTMdid not have difficulty on long sentences. For comparison, a phrase-based SMT system achieves a BLEU score of 33.3 on the same dataset. When we used the LSTM to rerank the 1000 hypotheses produced by the aforementioned SMT system, its BLEU score increases to 36.5, which is close to the previous best result on this task.
Finally, we found that reversing the order of the words in all source sentences (but not target sentences) improved the LSTM’s performancemarkedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problemeasier.

markroxor commented 7 years ago

Introduction -

Despite their flexibility and power, DNNs can only be applied to problems whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.
Sequences pose a challenge forDNNs because they require that the dimensionality of the inputs and outputs is known and fixed. In this paper, we show that a straightforward application of the Long Short-Term Memory (LSTM) architecture [16] can solve general sequence to sequence problems.
On theWMT’14 English to French translation task, we obtained a BLEU score of 34.81 by directly extracting translations from an ensemble of 5 deep LSTMs (with 384Mparameters and 8,000 dimensional state each) using a simple left-to-right beam-search decoder. This is by far the best result achieved by direct translation with large neural net-works. For comparison, the BLEU score of an SMT baseline on this dataset is 33.30 [29]. The 34.81 BLEU score was achieved by an LSTMwith a vocabulary of 80k words, so the score was penalized whenever the reference translation contained a word not covered by these 80k. This result shows that a relatively unoptimized small-vocabulary neural network architecture which has much room for improvement outperforms a phrase-based SMT system.
Surprisingly, the LSTMdid not suffer on very long sentences, despite the recent experience of other researchers with related architectures [26]. We were able to do well on long sentences because we reversed the order of words in the source sentence but not the target sentences in the training and test set. By doing so, we introducedmany short term dependencies that made the optimization problem much simpler (see sec. 2 and 3.3). As a result, SGD could learn LSTMs that had no trouble with long sentences. The simple trick of reversing the words in the source sentence is one of the key technical contributions of this work.
The Recurrent Neural Network (RNN) [31, 28] is a natural generalization of feedforward neural networks to sequences. Given a sequence of inputs (x1, . . . ,xT ), a standard RNN computes a sequence of outputs (y1, . . . , yT ) by iterating the following equation: ht = sigm?Whxxt +Whhht−1? yt = Wyhht The RNN can easily map sequences to sequences whenever the alignment between the inputs the outputs is known ahead of time. However, it is not clear how to apply an RNN to problems whose input and the output sequences have different lengthswith complicated and non-monotonic relation- ships.
The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT′ |x1, . . . ,xT ) where (x1, . . . ,xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequencewhose length The goal of the LSTM is to estimate the conditional probability p(y1, . . . , yT′ |x1, . . . ,xT ) where (x1, . . . ,xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequencewhose length T′ may differ fromT. The LSTMcomputes this conditional probability by first obtaining the fixed-(x1, . . . ,xT ) is an input sequence and y1, . . . , yT′ is its corresponding output sequencewhose length T′ may differ fromT. The LSTMcomputes this conditional probability by first obtaining the fixed-dimensional representation v of the input sequence (x1, . . . ,xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation ,xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation dimensional representation v of the input sequence (x1, . . . ,xT ) given by the last hidden state of the LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, LSTM, and then computing the probability of y1, . . . , yT′ with a standard LSTM-LM formulation whose initial hidden state is set to the representation v of x1, . . . ,xT :
Our actual models differ from the above description in three important ways. First, we used two different LSTMs: one for the input sequence and another for the output sequence, because doing so increases the number model parameters at negligible computational cost and makes it natural to train the LSTMonmultiple language pairs simultaneously [18]. Second,we found that deep LSTMs significantly outperformed shallowLSTMs, sowe chose an LSTMwith four layers. Third,we found it extremely valuable to reverse the order of the words of the input sentence. So for example, instead of mapping the sentence a, b, c to the sentence α,β, γ, the LSTM is asked to map c, b, a to α,β, γ, where α,β, γ is the translation of a, b, c. Thisway, a is in close proximity to α, b is fairly close to β, and so on, a fact thatmakes it easy for SGD to “establish communication” between the input and the output. We found this simple data transformation to greatly improve the performance of the LSTM.

markroxor commented 7 years ago

3.3 Reversing the Source Sentences

While the LSTM is capable of solving problems with long term dependencies, we discovered that the LSTM learns much better when the source sentences are reversed (the target sentences are not reversed). By doing so, the LSTM’s test perplexity dropped from 5.8 to 4.7, and the test BLEU scores of its decoded translations increased from 25.9 to 30.6.
While we do not have a complete explanation to this phenomenon, we believe that it is caused by the introduction of many short term dependencies to the dataset. Normally, when we concatenate a source sentencewith a target sentence, eachword in the source sentence is far fromits corresponding word in the target sentence. As a result, the problem has a large “minimal time lag” [17]. By reversing the words in the source sentence, the average distance between corresponding words in the source and target language is unchanged. However, the first few words in the source language are now very close to the first few words in the target language, so the problem’sminimal time lag is greatly reduced. Thus, backpropagation has an easier time “establishing communication” between the source sentence and the target sentence, which in turn results in substantially improved overall performance.

markroxor commented 7 years ago

3.4 Training details

We found that the LSTM models are fairly easy to train. We used deep LSTMs with 4 layers, with 1000 cells at each layer and 1000 dimensional word embeddings, with an input vocabulary of 160,000 and an output vocabulary of 80,000. Thus the deep LSTM uses 8000 real numbers to represent a sentence. We found deep LSTMs to significantly outperform shallow LSTMs, where each additional layer reduced perplexity by nearly 10%, possibly due to their much larger hidden state. We used a naive softmax over 80,000 words at each output. The resulting LSTM has 384M parameters of which 64M are pure recurrent connections (32M for the “encoder” LSTM and 32M for the “decoder” LSTM).

markroxor commented 7 years ago

3.5 Parallelization

A C++ implementation of deep LSTM with the configuration from the previous section on a sin-gle GPU processes a speed of approximately 1,700 words per second. This was too slow for our purposes, so we parallelized our model using an 8-GPU machine. Each layer of the LSTM was executed on a different GPU and communicated its activations to the next GPU / layer as soon as they were computed. Our models have 4 layers of LSTMs, each of which resides on a separate GPU. The remaining 4 GPUs were used to parallelize the softmax, so each GPU was responsible formultiplying by a 1000×20000matrix. The resulting implementation achieved a speed of 6,300 (both English and French) words per second with a minibatch size of 128. Training took about a ten days with this implementation.

jellAIfish / jellyfish

Sequence to Sequence Learning with Neural Networks #29