nlp-dke / NMTGMinor

Other
6 stars 2 forks source link

Some scripts are missing for zero-shot #1

Closed SefaZeng closed 2 years ago

SefaZeng commented 3 years ago

In the prepro.sh, there are some scripts used in defaultPreprocessor which I cant find in this repo. To reproduce the results in paper, preprocess procedure is very important. And could you guys release the preprocess data used in your experiments? That's a easy way to follow your work.

dannigt commented 3 years ago

Hi, thanks a lot for spotting this!

I've added the missing default scripts here: https://github.com/nlp-dke/NMTGMinor/tree/master/recipes/zero-shot/scripts/defaultPreprocessor. They are based on the scripts here, with additional support for sentencepiece tokenization.

As you suggested, the easiest way would be to upload the preprocessed data. I'm tracking the progress in this task. Currently trying to figure out the rights on redistribution.

SefaZeng commented 3 years ago

Hi, thanks a lot for spotting this!

I've added the missing default scripts here: https://github.com/nlp-dke/NMTGMinor/tree/master/recipes/zero-shot/scripts/defaultPreprocessor. They are based on the scripts here, with additional support for sentencepiece tokenization.

As you suggested, the easiest way would be to upload the preprocessed data. I'm tracking the progress in this task. Currently trying to figure out the rights on redistribution.

That's great! Thank you so much. And there is still another question about the data. I download the IWSLT training data from this web https://wit3.fbk.eu/2017-01 , but number of sentences for each direction is about 200k which is larger than the 145k reported in paper. Did I miss something?

dannigt commented 3 years ago

@SefaZeng Yes you're right! The original IWSLT 17 set contains 200k+ sentences.

We used the data from the MMCR4NLP repo: https://arxiv.org/pdf/1710.01025.pdf (for the IWSLT and Europarl-multiway experiments). This is a multiway subset of the full dataset. Sorry this wasn't made explicit in the paper! I'll update the arxiv version for a footnote.

I also tried the full IWSLT set and the gain from removing residual connections was similar.

We chose the multiway condition to hold the English data the same. Because some initial experiments showed gains in supervised directions from just having more diverse English data (this trend is also observable when comparing row 2 vs 3 in Table 3). Although this isn't really important for zero-shot, we chose to stick to a more constrained setup that factors out these additional influences.

SefaZeng commented 3 years ago

@SefaZeng Yes you're right! The original IWSLT 17 set contains 200k+ sentences.

We used the data from the MMCR4NLP repo: https://arxiv.org/pdf/1710.01025.pdf (for the IWSLT and Europarl-multiway experiments). This is a multiway subset of the full dataset. Sorry this wasn't made explicit in the paper! I'll update the arxiv version for a footnote.

I also tried the full IWSLT set and the gain from removing residual connections was similar.

We chose the multiway condition to hold the English data the same. Because some initial experiments showed gains in supervised directions from just having more diverse English data (this trend is also observable when comparing row 2 vs 3 in Table 3). Although this isn't really important for zero-shot, we chose to stick to a more constrained setup that factors out these additional influences.

Thank you for your reply. I have downloaded the data from MMCR4NLP, and I find the subset for IWSLT. But for Europarl data, I can only find the full data which consists of 17M sentences. So, what's the Europarl (non-overlap / multiway) ? As described in your paper and repo, it should contain 2M sentences and 119k for each direction.

dannigt commented 3 years ago

Thanks for following up! @SefaZeng

For Europarl, we used mmcr4nlp/europarl/10lingual/train.10langmultiway.europarl-v7.*[1] with 1067195 lines each. This corresponds to the Europarl-full case in the paper. The dev and test sets are also from there with 2000 lines for each direction.

Europarl-non-overlap and Europarl-multiway are both subsets of Europarl-full (as explained in subsection 3.1 and the last paragraph of section 4 directly above subsection 4.1). More specifically we split the full 1067195 lines into 9 subsets with 118577 lines for each direction (by split -l 118577 *).

The non-overlap subset was constructed by using subset{1,2,3,4,5,6,7,8} for {da,de,es,fi,fr,it,nl,pt} respectively, while the multiway subset by using subset 1 for all the languages. (The original data was already deduplicated.)

[1]: Here we did not include Swedish train.10langmultiway.europarl-v7.*sv*, hoping to use it for adaptation experiments like in section 4.2 but eventually did not have time run those experiments.

SefaZeng commented 3 years ago

Thanks for following up! @SefaZeng

For Europarl, we used mmcr4nlp/europarl/10lingual/train.10langmultiway.europarl-v7.*[1] with 1067195 lines each. This corresponds to the Europarl-full case in the paper. The dev and test sets are also from there with 2000 lines for each direction.

Europarl-non-overlap and Europarl-multiway are both subsets of Europarl-full (as explained in subsection 3.1 and the last paragraph of section 4 directly above subsection 4.1). More specifically we split the full 1067195 lines into 9 subsets with 118577 lines for each direction (by split -l 118577 *).

The non-overlap subset was constructed by using subset{1,2,3,4,5,6,7,8} for {da,de,es,fi,fr,it,nl,pt} respectively, while the multiway subset by using subset 1 for all the languages. (The original data was already deduplicated.)

[1]: Here we did not include Swedish train.10langmultiway.europarl-v7.*sv*, hoping to use it for adaptation experiments like in section 4.2 but eventually did not have time run those experiments.

Thanks for your reply. I am trying to reproduce the result for no residual in fairseq, and I find it perform wrose than baseline. I think the only change is that I canceled residual connection in one encoder layers. And I also find the BLEU of nl->it, ro->it, and ro->nl directions is much lower than other three directions. Is this normal? And is there some tips on the code about the residual connection? Any response is appreciate. Thank you!

dannigt commented 3 years ago

Thanks for the updates! @SefaZeng

I was able to reproduce the results in fairseq using: https://github.com/dannigt/fairseq/tree/master/examples/residual_drop

The training script is below. It drops the residual connection after self-attention in the 3rd encoder layer as described in our paper.

fairseq-train \
    $DATA_BIN \
    --user-dir $FAIRSEQ_DIR/examples/residual_drop/residual_drop_src \
    --arch residual_drop_transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 8 --decoder-attention-heads 8 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.2 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
    --optimizer adam --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 \
    --max-tokens 4000 \
    --update-freq 4 \
    --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 --no-epoch-checkpoints \
    --fp16 \
    --task translation_multi_simple_epoch \
    --decoder-langtok \
    --encoder-langtok "tgt" \
    --lang-pairs "$lang_pairs" \
    --encoder-drop-residual 2 \
    --save-dir ./r3.example

The resulting BLEU scores were 31.9, 28.0, 23.3, 35.2, 30.4, 29.9 for en-it, en-nl, en-ro, it-en, nl-en, ro-en and 18.0, 17.6, 18.7, 14.8, 19.8, 16.3 for it-nl, it-ro, nl-it, nl-ro, ro-it, ro-nl.

There I found it was important to prepend the target language tag (--encoder-langtok "tgt") to the source sentences. (In our implementation in this repo, we don't prepend any tokens to the source sentences, but have a target token at every decoding step.) I'm not entirely sure why the token makes so much difference. A recent paper from bytedance discusses a bit about this: https://arxiv.org/pdf/2106.07930.pdf. As currently I'm on another project, I'd like to look further into this once I'm back to this project.

dannigt commented 3 years ago

An additional note to the fairseq model above: for the baseline (w/o --encoder-drop-residual 2), the BLEU scores were 32.2, 27.8, 23.2, 35.5, 31.3, 30.1 or en-it, en-nl, en-ro, it-en, nl-en, ro-en and 8.0, 6.2, 7.3, 5.9, 7.2, 7.6 for it-nl, it-ro, nl-it, nl-ro, ro-it, ro-nl.

SefaZeng commented 3 years ago

Thanks for the updates! @SefaZeng

I was able to reproduce the results in fairseq using: https://github.com/dannigt/fairseq/tree/master/examples/residual_drop

The training script is below. It drops the residual connection after self-attention in the 3rd encoder layer as described in our paper.

fairseq-train \
    $DATA_BIN \
    --user-dir $FAIRSEQ_DIR/examples/residual_drop/residual_drop_src \
    --arch residual_drop_transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 8 --decoder-attention-heads 8 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.2 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
    --optimizer adam --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 \
    --max-tokens 4000 \
    --update-freq 4 \
    --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 --no-epoch-checkpoints \
    --fp16 \
    --task translation_multi_simple_epoch \
    --decoder-langtok \
    --encoder-langtok "tgt" \
    --lang-pairs "$lang_pairs" \
    --encoder-drop-residual 2 \
    --save-dir ./r3.example

The resulting BLEU scores were 31.9, 28.0, 23.3, 35.2, 30.4, 29.9 for en-it, en-nl, en-ro, it-en, nl-en, ro-en and 18.0, 17.6, 18.7, 14.8, 19.8, 16.3 for it-nl, it-ro, nl-it, nl-ro, ro-it, ro-nl.

There I found it was important to prepend the target language tag (--encoder-langtok "tgt") to the source sentences. (In our implementation in this repo, we don't prepend any tokens to the source sentences, but have a target token at every decoding step.) I'm not entirely sure why the token makes so much difference. A recent paper from bytedance discusses a bit about this: https://arxiv.org/pdf/2106.07930.pdf. As currently I'm on another project, I'd like to look further into this once I'm back to this project.

Hi, @dannigt , hope you are still here. Is there some hyperparameter difference between IWSLT and Europarl? I follow the setting of IWSLT above, and I can reproduce the supervised result of Europarl. But for zero-shot performance, drop the residual only outperform 1.5 BLEU which is much less than the scores in paper. Have ever tried to reproduce the result of Europarl in fairseq? Any reply is appreciated. Thx in advance!

dannigt commented 3 years ago

Thanks for the updates! @SefaZeng I was able to reproduce the results in fairseq using: https://github.com/dannigt/fairseq/tree/master/examples/residual_drop The training script is below. It drops the residual connection after self-attention in the 3rd encoder layer as described in our paper.

fairseq-train \
    $DATA_BIN \
    --user-dir $FAIRSEQ_DIR/examples/residual_drop/residual_drop_src \
    --arch residual_drop_transformer --share-all-embeddings \
    --encoder-layers 5 --decoder-layers 5 \
    --encoder-embed-dim 512 --decoder-embed-dim 512 \
    --encoder-ffn-embed-dim 2048 --decoder-ffn-embed-dim 2048 \
    --encoder-attention-heads 8 --decoder-attention-heads 8 \
    --encoder-normalize-before --decoder-normalize-before \
    --dropout 0.2 --attention-dropout 0.2 --relu-dropout 0.2 \
    --weight-decay 0.0001 \
    --label-smoothing 0.1 --criterion label_smoothed_cross_entropy \
    --optimizer adam --clip-norm 0 \
    --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-7 \
    --lr 1e-3 \
    --max-tokens 4000 \
    --update-freq 4 \
    --save-interval 1 --save-interval-updates 5000 --keep-interval-updates 5 --no-epoch-checkpoints \
    --fp16 \
    --task translation_multi_simple_epoch \
    --decoder-langtok \
    --encoder-langtok "tgt" \
    --lang-pairs "$lang_pairs" \
    --encoder-drop-residual 2 \
    --save-dir ./r3.example

The resulting BLEU scores were 31.9, 28.0, 23.3, 35.2, 30.4, 29.9 for en-it, en-nl, en-ro, it-en, nl-en, ro-en and 18.0, 17.6, 18.7, 14.8, 19.8, 16.3 for it-nl, it-ro, nl-it, nl-ro, ro-it, ro-nl. There I found it was important to prepend the target language tag (--encoder-langtok "tgt") to the source sentences. (In our implementation in this repo, we don't prepend any tokens to the source sentences, but have a target token at every decoding step.) I'm not entirely sure why the token makes so much difference. A recent paper from bytedance discusses a bit about this: https://arxiv.org/pdf/2106.07930.pdf. As currently I'm on another project, I'd like to look further into this once I'm back to this project.

Hi, @dannigt , hope you are still here. Is there some hyperparameter difference between IWSLT and Europarl? I follow the setting of IWSLT above, and I can reproduce the supervised result of Europarl. But for zero-shot performance, drop the residual only outperform 1.5 BLEU which is much less than the scores in paper. Have ever tried to reproduce the result of Europarl in fairseq? Any reply is appreciated. Thx in advance!

Hi @SefaZeng, thanks for following up! If it's the Europarl-Ful case, there is a difference in that the zero-shot dev directions were also included in the dev set for early stopping (described in the 3rd paragraph of section 3.2). Otherwise, I didn't run Europarl experiments with fariseq. The implementation of the target language token does differ a bit in this repo from fairseq. But by gut feelings I would not expect it to make a big difference. If the difference is still there after confirming the dev set's the same, we might compare our training logs to look deeper.

dannigt commented 2 years ago

Closing this now - but feel free to reach out in case of further questions!