mit-han-lab / lite-transformer

[ICLR 2020] Lite Transformer with Long-Short Range Attention
https://arxiv.org/abs/2004.11886
Other
596 stars 81 forks source link

Wrong key_padding_mask position? #6

Closed godweiyang closed 3 years ago

godweiyang commented 4 years ago

In original fairseq code, encoder_padding_mask is done after linear1 as following:

x = self.linear1(x)
if self.act is not None:
    x = self.act(x)
if encoder_padding_mask is not None:
    x = x.masked_fill(encoder_padding_mask.transpose(0, 1).unsqueeze(2), 0)
x = self.conv(x)
x = self.linear2(x)

However, in your code, encoder_padding_mask is done before linear1 as following:

mask = key_padding_mask
if mask is not None:
    q = q.masked_fill(mask.transpose(0, 1).unsqueeze(2), 0)
x = branch(q.contiguous(), incremental_state=incremental_state)

I think the bias in linear1 will make the encoder_padding_mask useless in your code. In my experiments, the second approach gets a bad result and the first one gets a normal result.

Michaelvll commented 4 years ago

Hi Wei,

Sorry for my late reply and thank you very much for asking! For self-attention modules, the place of masking may influence the performance. However, for the dynamic convolution, we did not observe many differences in performance by applying the mask before the linear1.

To verify this, I trained the dynamic convolution model (Wu et al., Pay Less Attention with Lightweight and Dynamic Convolutions) in fairseq with the two masking methods. On WMT En-De dataset, the model achieves 27.7 with the original masking place and 27.9 with the masking place used in our code. The training commands are shown below:

python train.py \
  data-bin/wmt_en_de \
  --fp16  --log-interval 100 --no-progress-bar \
  --max-update 30000 --share-all-embeddings --optimizer adam \
  --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
  --ddp-backend=no_c10d --max-tokens 3584 \
  --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
  --t-mult 1 --lr-period-updates 20000 \
  --arch lightconv_wmt_en_de --save-dir $SAVE \
  --encoder-glu 1 --decoder-glu 1 \
  --distributed-port $PORT --distributed-world-size 8 --num-workers 16

Similar behaviors are also observed in our model. Placing the mask before the linear1 layer still ensures the gradients of padding in the outputs of the previous layer to be zero and limits the influence of the paddings to the training process.

Thank you again for pointing out the difference! I think it would be a good idea to change our codebase to be consistent with the original one and I am glad to revise this in our future update.

godweiyang commented 4 years ago

Hi Wei,

Sorry for my late reply and thank you very much for asking! For self-attention modules, the place of masking may influence the performance. However, for the dynamic convolution, we did not observe many differences in performance by applying the mask before the linear1.

To verify this, I trained the dynamic convolution model (Wu et al., Pay Less Attention with Lightweight and Dynamic Convolutions) in fairseq with the two masking methods. On WMT En-De dataset, the model achieves 27.7 with the original masking place and 27.9 with the masking place used in our code. The training commands are shown below:

python train.py \
  data-bin/wmt_en_de \
  --fp16  --log-interval 100 --no-progress-bar \
  --max-update 30000 --share-all-embeddings --optimizer adam \
  --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \
  --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
  --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
  --ddp-backend=no_c10d --max-tokens 3584 \
  --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
  --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
  --t-mult 1 --lr-period-updates 20000 \
  --arch lightconv_wmt_en_de --save-dir $SAVE \
  --encoder-glu 1 --decoder-glu 1 \
  --distributed-port $PORT --distributed-world-size 8 --num-workers 16

Similar behaviors are also observed in our model. Placing the mask before the linear1 layer still ensures the gradients of padding in the outputs of the previous layer to be zero and limits the influence of the paddings to the training process.

Thank you again for pointing out the difference! I think it would be a good idea to change our codebase to be consistent with the original one and I am glad to revise this in our future update.

Thanks for your reply! Did you mean that you re-ran the fairseq code on WMT14-en-de with lightweight convolution, and only get a bleu of 27.9? In the original paper (Wu et al., Pay Less Attention with Lightweight and Dynamic Convolutions), they reported a bleu of 28.9 with lightweight convolution and 29.7 with dynamic convolution. But I also re-implement the two convolutions using Tensorflow and only get a bleu of about 27.7~27.9. So I do not know if it is the problems of fairseq itself or I omit some tricks?

Michaelvll commented 4 years ago

The bleu score of 27.9 is for the model with similar embedding size as the transformer base (512/2048). I believe that in the original paper, 28.9 and 29.7 are for the models similar to the transformer big (1024/4096). For fairseq, you may need to set --arch=lightconv_wmt_en_de_big to get a similar result of the original paper.