Transformer Implementation Improvements

neubig commented 6 years ago

There are a few improvements that could be made to the transformer implementation:

We have not confirmed that it gets competitive performance with the original implementation, nor created a recipe file.
It's still a bit slow.
It's kind-of separate from the other parts of xnmt, and it'd be better to integrate it by using the standard "translator" class, etc.

Edit: Based on @DevSinghSachan 's suggestions, I've summarized things to be fixed, which will probably be fixed in multiple PRs.

Here is a checklist to be finished for speed improvements in training:

[x] Cleaner padding/masking
[ ] Avoid loading positional encodings onto the GPU every time by making them parameters
[x] Remove for loops in multi-head attention
[ ] Need to make sure DyNet is using batched matrix multiplies: https://github.com/clab/dynet/issues/1354

Here is a checklist of things to be finished to fix decoding:

[ ] Make the decoding fit in the standard inference classes. currently decoding is done entirely on it's own
[ ] Avoid re-encoding the source sentence during decoding (probably fixed by the previous one)

Here is a checklist of things that need to be implemented (either in code or in the yaml recipe) for accuracy:

[ ] Use of different dropout parameters
[ ] Proper placement of layer norm
[ ] Initialization of embeddings
[ ] Make sure that we're using the recommended Adam optimizer
[ ] Adjust the number of words in the minibatch
[ ] Make sure that we're using the right BLEU script

DevSinghSachan commented 6 years ago

Here are a few of the improvements which I can think of to make the Transformer in xnmt competitive with the one implemented in Tensor2Tensor or the one in Pytorch. I will group the following in 2 categories: speed improvements and model improvements based on my experience developing the model.

Speed Improvements: (a) Support of zero-embedding (vector of all zeros) for a padding index in Dynet. Currently, in order to achieve this, we are doing lookup for each word individually so that we can insert tensor of zeros for the padding index. If this feature is supported, we can do batch lookup of all words in a minibatch, and this should improve speed by 1000-2000 wps. I have also tried using external masks for achieving this, but they slow the speed even further.

(b) A better strategy for adding position encodings: Currently, we have to use dynet.inputTensor() for the addition of position encodings at every minibatch step. If we can have initialize a variable which does not require gradients, so that we don't need to convert the numpy array to Tensor everytime, this can lead to speed improvement by 500-1000 wps. I tried using a Variable, but this leads to errors like stale variables after the 1st step.

(c) Multi-Head Attention: I think, the BLEU score improvements with multi-heads (8) as shown in the transformer paper was about 1.0 point on WMT En-De dev-dataset. But to improve, it I think that if we can have "dy.pick_batch_elems" function also accepts a range argument similar to "dy.pickrange", then we can observe speed improvement. I may also be missing something here as I am not much familiar as to how these functions are implemented internally in Dynet. Also, the option of different implementation of Dot-product transformer attention exists.

(d) Support for batched decoding: I guess, currently, in XNMT we support decoding of sentences in batch size 1 with optional length normalization and beam search. If we can have support for batched decoding with the same features, then it will also be very helpful in quick evaluation. For transformer, we can also cache the encoder states for decoding. Caching the decoder states is more complicated though.

Model Improvements: (a) Use of different dropout parameters: embedding_dropout: 0.1, attentional_dropout: 0.1, ffn_linear_dropout: 0.1 and residual_dropout: 0.1. Currently, we have these dropouts except the ffn_linear_dropout which needs to be added after ReLU activation, but these values are controlled by a single dropout parameter.

(b) Proper placement of LayerNorm: LayerNorm should be present at the start of each sublayer, so that residual connections are between input to LayerNorm and output of a sublayer. Also, once all the encoder and decoder layers features are computed, we apply a LayerNorm again. Afterwards, the encoder features are used to for attention, and the decoder features for loss calculation/decoding. This is important for proper optimization of parameters as well.

(c) Initialization of the embedding layer: The embedding layer needs to be initialized with a truncated Normal Gaussain N(0, sqrt(d)). The truncation is generally +- 2*std_dev from the mean . Here, 'd' is the dimension of the embedding layer (for example, 512). This is a default initializer in Tensorflow. Also, I feel this is the reason, why the authors of the paper, multiply by a factor of sqrt(d) after the embedding lookup. But, this may not have much effect on the final metrics. Maybe, just for the sake of correctness.

(d) Weights Initialization: Authors use Glorot / Xavier Uniform initialization for the params. However, LeCun Uniform initialization also works equally well (in my experience). I prefer to use the 2nd one as with ReLU activations, in several other related tasks, it tends to perform better.

(e) Optimizer: The parameters can be quite important. For single gpu case, warmup_steps: 16000, learning_rate_constant = 2. The variable learning_rate is always multiplied by learning rate_constant. Also, beta1=0.9 and beta2=0.997. With these additions, the Transformer AdamTrainer should be good. (https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/learning_rate.py#L43)

(f) Label Smoothing: Currently, we are doing smoothing by (1/V), but for correctness, we need to do by (1/ (V-2)). One of the dimensions ignored is the padding one and the other is the next work index. But agian, this may have minimal impact.

(g) Number of words in a minibatch: For the single GPU case, the T2T toolkit uses, maximum 4096 words in both the source and target together. I am not sure, if they include the padding words in calculation or not though. If we can have this as a configurable option in the current awesome batching strategy in XNMT, this will be helpful, though.

(h) BLEU Score: T2T has a script called t2t-bleu to compute BLEU scores. This is based on mteval.perl (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v14.pl) script and is now also recommended by multi-bleu.perl (in the form of a warning). (https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl#L172). I have ported t2t-bleu code in order to compute BLEU scores, which is here. I had to change the extension to .txt to support attachment. t2t-bleu.txt Its usage is: t2t-bleu --translation=test.out --reference=test.tgt

Lastly, using a 6 layer transformer model, can give a score of 28.12 on En->Vi (IWSLT, 2015) and 40.7 on Ja->En trian-big datasets which are included with XNMT.

Also, please let me know in case any of the above points are unclear or are erroneous. Thanks, Devendra .

neubig commented 6 years ago

Started working on the speed-related issues here: https://github.com/neulab/xnmt/tree/transformer-optimizations

neulab / xnmt

Transformer Implementation Improvements #320