universal_transformer in machine translation

zherowolf commented 6 years ago

Description

hi,guys, Did someone try universal transformer in machine translation tasks? My experiments with default settings does not surpass transformer in zh-en mt task.

MostafaDehghani commented 6 years ago

by "default settings", you mean "--hparams_set=universal_transformer_base"? Have you tried "--hparams_set=universal_transformer_fc_base"?

zherowolf commented 6 years ago

thanks for your reply . @MostafaDehghani actually I used "--hparams_set=universal_transformer_base" as default settings and I haven't tried "universal_transformer_fc_base". I will try that. BTW, would you please show your settings or hyperparameters used in your paper? So I could reproduced the universal transformer in EN-DE translation task with a BLEU 28.9 as you said in your remarkable paper.

MostafaDehghani commented 6 years ago

No problem :) For EN-DE, we used "universal_transformer_fc_base" and trained the model in multi-gpu setup (8*P100 GPUs, for 500k steps I believe). You should make sure that the capacity of the model (number of trainable parameters) for the Universal Transformer is similar to the counterpart Transformer model.

zherowolf commented 6 years ago

Thanks , that would be helpful I believe! I will try that in EN-DE and report my results later.

phildani7 commented 6 years ago

My settings: DATA_DIR=/home/phil/t2t_data_big_en_hi OUTDIR=/home/phil/big_en_hi/trained_model t2t-trainer \ --data_dir=$DATA_DIR \ --t2t_usr_dir=./big_en_hi/trainer \ --problem=big_en_hi \ --model=universal_transformer \ --hparams_set=universal_transformer_base \ --output_dir=$OUTDIR \ --worker_gpu=2 \ --train_steps=10000000

And the error: INFO:tensorflow:Cannot use 'Identity_74' as input to 'Identity_17' because they are in different while loops.

Identity_74 while context: universal_transformer/parallel_1_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context Identity_17 while context: universal_transformer/parallel_0_5/universal_transformer/universal_transformer/body/encoder/universal_transformer_basic/foldl/while/while_context

Any help?

MostafaDehghani commented 6 years ago

Isn't this the same issue raised in #1006?

phildani7 commented 6 years ago

Error exists even in the latest versions. tensor2tensor==1.8.0 tensorboard==1.10.0 tensorflow==1.10.0

zherowolf commented 6 years ago

for your information , I almost reproduced ende translation task results. In the paper, they achieved 28.9 with universal transformer. here's my results. my baseline with transformer base achieved 28.19 BLEU and universal transformer achieved 28.63 BLEU so far (have not reached convergence yet ) thanks @MostafaDehghani , great work!

zherowolf commented 6 years ago

for your information , the bleu reached 28.9 now.

colmantse commented 6 years ago

Hi, is this sota minus the preprocessing and deliberation network?

MostafaDehghani commented 6 years ago

@zherowolf great! Just make sure that you are not looking at the "approximate BLEU" in t2t :) Check out #436

zherowolf commented 6 years ago

I preprocessed my training and test data with moses scripts (inculding segment). And computed BLEU scores of each checkpoint results with mteval-v13a.perl after detokenize. BTW, I used 'training-parallel-nc-v11.tgz" in WMT16 , maybe differents from your guys.

Bournet commented 6 years ago

@MostafaDehghani Hi, I can't find "universal_transformer_fc_base" in the latest code. Was it replaced?

MostafaDehghani commented 6 years ago

@Bournet, Yep! Since people were mostly interested to try the MT experiments, I changed the default of the transition function from "sepconv" to "fc" in a PR I sent two days ago:https://github.com/tensorflow/tensor2tensor/pull/1036/commits/e4968979f904a7bcdf3ffe0591781f0efe2dae98

So right now, "universal_transformer_base" (which is equal to "universal_transformer_fc_base" in the old code) is the hparams_set you need to use to reproduce the MT results in the paper :)

Bournet commented 6 years ago

@MostafaDehghani Ok, thank you for the reply :)

robotzheng commented 6 years ago

@zherowolf, can you share your logs and configs? I can't reproduce the paper's result, it's not convergence. INFO:tensorflow:loss = 5.653845, step = 2500 (147.504 sec) INFO:tensorflow:global_step/sec: 0.692351 INFO:tensorflow:loss = 5.602401, step = 2600 (144.438 sec) INFO:tensorflow:global_step/sec: 0.692506 INFO:tensorflow:loss = 5.603458, step = 2700 (144.400 sec) INFO:tensorflow:global_step/sec: 0.695547 INFO:tensorflow:loss = 5.5827146, step = 2800 (143.772 sec) INFO:tensorflow:global_step/sec: 0.692561 INFO:tensorflow:loss = 5.7178345, step = 2900 (144.391 sec) INFO:tensorflow:global_step/sec: 0.691745 INFO:tensorflow:loss = 5.53726, step = 3000 (144.562 sec) INFO:tensorflow:global_step/sec: 0.693078 INFO:tensorflow:loss = 5.4643216, step = 3100 (144.284 sec) INFO:tensorflow:global_step/sec: 0.691953 INFO:tensorflow:loss = 5.4527507, step = 3200 (144.519 sec) INFO:tensorflow:global_step/sec: 0.690533 INFO:tensorflow:loss = 5.5876875, step = 3300 (144.816 sec) INFO:tensorflow:global_step/sec: 0.692915 INFO:tensorflow:loss = 5.5414114, step = 3400 (144.318 sec)

li10141110 commented 6 years ago

@zherowolf hi,could you please show us your t2t_trainer settings,thank you in advance! following is my settings(no convergence): nohup t2t-trainer \ --data_dir=$DATA_DIR \ --problem=translate_ende_wmt32k \ --model=universal_transformer \ --hparams_set=universal_transformer_base \ --hparams='batch_size=5120' \ --train_steps=7000000 \ --random_seed=33 \ --worker_gpu=8 \ --output_dir=$TRAIN_DIR \ --eval_steps=10000 & any help?

robotzheng commented 5 years ago

@zherowolf , --train_steps=7000000 ? how many steps when your model reach 28.9? From your above figure, 69 epoches, it is about 550000 steps, is that correct? Thanks.

zherowolf commented 5 years ago

sorry for replying late. My Experiments settings are followings: [data preprocess]

I used "training-parallel-commoncrawl.tgz" , "training-parallel-europarl-v7.tgz" and "training-parallel-nc-v12.tgz" which are available on wmt website.
I preprocessed my data with https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-wmt14en2de.sh and thanks to @myleott

[setup] I did not use translate_ende_wmt32k and defined my own problem and hparams for ende , but I don't think there's much difference. for transformer base model: for universal transformer base:

[training] I trained both models with 4 gpus of V100 . for transformer base, I trained 361600 steps within about 60 hours. for universal transformer, I trained 262000 steps wthin about 90 hours. and the final results in newstest2014 with each checkpoint are in following:

I also trained a transformer big model (green line), which is better with more parameters.

I hope this would help . @robotzheng @li10141110

520jefferson commented 5 years ago

@zherowolf i train with both universal_transformer_big and transformer_big in one gpu P40 (PROBLEM=translate_enzh_wmt32k), but in the training process ,the process will stop itself early, it's so weird. Have you met this?

xuekun90 commented 5 years ago

@MostafaDehghani could you help to see the issue #1006 , several people meet a common problem, when tried hparams of universal_transformer_base, about 10W steps, it will not be convergence, the loss maintain about 4~5. Any solution to this?

INFO:tensorflow:Saving dict for global step 109000: global_step = 109000, loss = 4.23917, metrics-translate_enzh_wmt32k/targets/accuracy = 0.32611865, metrics-translate_enzh_wmt32k/targets/accuracy_per_sequence = 0.0, metrics-translate_enzh_wmt32k/targets/accuracy_top5 = 0.5148043, metrics-translate_enzh_wmt32k/targets/approx_bleu_score = 0.028782098, metrics-translate_enzh_wmt32k/targets/neg_log_perplexity = -4.2202907, metrics-translate_enzh_wmt32k/targets/rouge_2_fscore = 0.08715047, metrics-translate_enzh_wmt32k/targets/rouge_L_fscore = 0.33720428 INFO:tensorflow:Saving 'checkpoint_path' summary for global step 109000: /home/exuekun/AI_Challenger_2018_base/Baselines/english_chinese_machine_translation_baseline/train/universal_train/model.ckpt-109000

kudou1994 commented 5 years ago

@MostafaDehghani 你可以帮忙看看问题＃1006，几个人遇到一个常见的问题，当试用hparams的universal_transformer_base时，大约10W步，它不会收敛，损失保持在4~5左右。对此有何解决方案？

INFO：tensorflow：保存全局步骤109000的dict：global_step = 109000，loss = 4.23917，metrics-translate_enzh_wmt32k / targets / accuracy = 0.32611865，metrics-translate_enzh_wmt32k / targets / accuracy_per_sequence = 0.0，metrics-translate_enzh_wmt32k / targets / accuracy_top5 = 0.5148043，metrics -translate_enzh_wmt32k / targets / approx_bleu_score = 0.028782098，metrics-translate_enzh_wmt32k / targets / neg_log_perplexity = -4.2202907，metrics-translate_enzh_wmt32k / targets / rouge_2_fscore = 0.08715047，metrics-translate_enzh_wmt32k / targets / rouge_L_fscore = 0.33720428 INFO：tensorflow：为全局保存'checkpoint_path'摘要步骤109000：/home/exuekun/AI_Challenger_2018_base/Baselines/english_chinese_machine_translation_baseline/train/universal_train/model.ckpt-109000

hi, Can I add your WeChat? i have the same question.

li10141110 commented 5 years ago

@zherowolf Thank you so much and could please tell us your universal transformer t2t_trainer settings in zh-en mt task. My loss fluctuated between 2-3 within 400,000 steps in my experiments with default settings in zh-en mt task.

h-karami commented 5 years ago

sorry for replying late. My Experiments settings are followings: [data preprocess]
1. I used "training-parallel-commoncrawl.tgz" , "training-parallel-europarl-v7.tgz" and "training-parallel-nc-v12.tgz" which are available on wmt website.

2. I preprocessed my data with https://github.com/pytorch/fairseq/blob/master/examples/translation/prepare-wmt14en2de.sh
   and thanks to @myleott
[setup] I did not use translate_ende_wmt32k and defined my own problem and hparams for ende , but I don't think there's much difference. for transformer base model: for universal transformer base:

[training] I trained both models with 4 gpus of V100 . for transformer base, I trained 361600 steps within about 60 hours. for universal transformer, I trained 262000 steps wthin about 90 hours. and the final results in newstest2014 with each checkpoint are in following:

I also trained a transformer big model (green line), which is better with more parameters.

I hope this would help . @robotzheng @li10141110

@zherowolf Will you share your code?

lkluo commented 5 years ago

"universal_transformer_big" does not work, I have to use "universal_transformer_base" with "hparams="hidden_size=2048,filter_size=8196". "universal_transformer_big" requires more GPU RAM than "transformer_big", thus smaller batch size (in my case 2048 vs 3600).

tensorflow / tensor2tensor

universal_transformer in machine translation #1021

Description