Closed godweiyang closed 3 years ago
Hi Wei,
Sorry for my late reply and thank you very much for asking! For self-attention modules, the place of masking may influence the performance. However, for the dynamic convolution, we did not observe many differences in performance by applying the mask before the linear1
.
To verify this, I trained the dynamic convolution model (Wu et al., Pay Less Attention with Lightweight and Dynamic Convolutions) in fairseq
with the two masking methods. On WMT En-De dataset, the model achieves 27.7 with the original masking place and 27.9 with the masking place used in our code. The training commands are shown below:
python train.py \
data-bin/wmt_en_de \
--fp16 --log-interval 100 --no-progress-bar \
--max-update 30000 --share-all-embeddings --optimizer adam \
--adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \
--criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
--min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \
--ddp-backend=no_c10d --max-tokens 3584 \
--lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \
--lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \
--t-mult 1 --lr-period-updates 20000 \
--arch lightconv_wmt_en_de --save-dir $SAVE \
--encoder-glu 1 --decoder-glu 1 \
--distributed-port $PORT --distributed-world-size 8 --num-workers 16
Similar behaviors are also observed in our model. Placing the mask before the linear1
layer still ensures the gradients of padding in the outputs of the previous layer to be zero and limits the influence of the paddings to the training process.
Thank you again for pointing out the difference! I think it would be a good idea to change our codebase to be consistent with the original one and I am glad to revise this in our future update.
Hi Wei,
Sorry for my late reply and thank you very much for asking! For self-attention modules, the place of masking may influence the performance. However, for the dynamic convolution, we did not observe many differences in performance by applying the mask before the
linear1
.To verify this, I trained the dynamic convolution model (Wu et al., Pay Less Attention with Lightweight and Dynamic Convolutions) in
fairseq
with the two masking methods. On WMT En-De dataset, the model achieves 27.7 with the original masking place and 27.9 with the masking place used in our code. The training commands are shown below:python train.py \ data-bin/wmt_en_de \ --fp16 --log-interval 100 --no-progress-bar \ --max-update 30000 --share-all-embeddings --optimizer adam \ --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --weight-decay 0.0 \ --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \ --min-lr 1e-09 --update-freq 16 --attention-dropout 0.1 --keep-last-epochs 10 \ --ddp-backend=no_c10d --max-tokens 3584 \ --lr-scheduler cosine --warmup-init-lr 1e-7 --warmup-updates 10000 \ --lr-shrink 1 --max-lr 0.001 --lr 1e-7 --min-lr 1e-9 --warmup-init-lr 1e-07 \ --t-mult 1 --lr-period-updates 20000 \ --arch lightconv_wmt_en_de --save-dir $SAVE \ --encoder-glu 1 --decoder-glu 1 \ --distributed-port $PORT --distributed-world-size 8 --num-workers 16
Similar behaviors are also observed in our model. Placing the mask before the
linear1
layer still ensures the gradients of padding in the outputs of the previous layer to be zero and limits the influence of the paddings to the training process.Thank you again for pointing out the difference! I think it would be a good idea to change our codebase to be consistent with the original one and I am glad to revise this in our future update.
Thanks for your reply! Did you mean that you re-ran the fairseq code on WMT14-en-de with lightweight convolution, and only get a bleu of 27.9? In the original paper (Wu et al., Pay Less Attention with Lightweight and Dynamic Convolutions), they reported a bleu of 28.9 with lightweight convolution and 29.7 with dynamic convolution. But I also re-implement the two convolutions using Tensorflow and only get a bleu of about 27.7~27.9. So I do not know if it is the problems of fairseq itself or I omit some tricks?
The bleu score of 27.9 is for the model with similar embedding size as the transformer base (512/2048). I believe that in the original paper, 28.9 and 29.7 are for the models similar to the transformer big (1024/4096). For fairseq
, you may need to set --arch=lightconv_wmt_en_de_big
to get a similar result of the original paper.
In original fairseq code,
encoder_padding_mask
is done afterlinear1
as following:However, in your code,
encoder_padding_mask
is done beforelinear1
as following:I think the
bias
inlinear1
will make theencoder_padding_mask
useless in your code. In my experiments, the second approach gets a bad result and the first one gets a normal result.