tensorflow / tensor2tensor

Library of deep learning models and datasets designed to make deep learning more accessible and accelerate ML research.
Apache License 2.0
15.53k stars 3.5k forks source link

proper size of wmt_ende_tokens_32k-{dev, train}* file? #61

Closed neverdoubt closed 7 years ago

neverdoubt commented 7 years ago

I got quite low performance compared to the paper.

So, i did some research, and I found the sizes of wmt_ende_tokens_32k-{dev, train}* are too small as follows. 444K wmt_ende_tokens_32k-dev-00000-of-00001 730M wmt_ende_tokens_32k-train-00000-of-00001

I ran t2t_datagen again, then i got the following sizes. (with 100 split option) 820K wmt_ende_tokens_32k-dev-00000-of-00001 14M wmt_ende_tokens_32k-train-00000-of-00100 .... (total 1400M)

what is the proper size of wmt_ende_tokens_32k-* file?

lukaszkaiser commented 7 years ago

The first size looks correct: 444K for dev set (it's only a few thousand sentence pairs, each sentence is ~20 ints, 2204 bytes/int gives ~160 bytes/sentence pair, so ~400kb looks ok). My -train is sharded 100x and I have 7MB in each file (the dataset is 4M pairs, so again, it makes sense).

lukaszkaiser commented 7 years ago

Since a few people are complaining, could you post the details of your training and results? Did you train on 1 or many GPUs? For how many steps? What is the eval printing out? If it's 1-GPU, you should use the transformer_base_single_gpu hparams config, we should make this clearer in the readme.

neverdoubt commented 7 years ago

[3X titan black, base model]

global_step = 331361, loss = 1.5812, metrics-wmt_ende_tokens_32k/accuracy = 0.663404, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.000824742, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.84237, metrics-wmt_ende_tokens_32k/bleu_score = 0.325035, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.79698, metrics/accuracy = 0.663404, metrics/accuracy_per_sequence = 0.000824742, metrics/accuracy_top5 = 0.84237, metrics/bleu_score = 0.325035, metrics/neg_log_perplexity = -1.79698 => actual bleu : 21.x on newstest2013

[4x titan x, small model d_model = 256 , d_k = 32, d_v = 32, which is 4th-(C) model in table3 in the paper] global_step = 167127, loss = 1.48184, metrics-wmt_ende_tokens_32k/accuracy = 0.675814, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00234962, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.852813, metrics-wmt_ende_tokens_32k/bleu_score = 0.340511, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -1.6753, metrics/accuracy = 0.675814, metrics/accuracy_per_sequence = 0.00234962, metrics/accuracy_top5 = 0.852813, metrics/bleu_score = 0.340511, metrics/neg_log_perplexity = -1.6753 => actual bleu : 23.85 on newstest2013 however, i got "ran out of range" when eval steps are larger than 23. If i understood correctly, dev dataset has much more pairs and more than 100 eval steps should not result "ran out of range". this is why i suspect dataset for the low performance.

zxw866 commented 7 years ago

I trained on 8 NVIDIA TITAN-XP with the “transformer_base“ parameters: @registry.register_hparams def transformer_base(): """Set of hyperparameters.""" hparams = common_hparams.basic_params1() hparams.hidden_size = 512 hparams.batch_size = 4096 hparams.max_length = 256 hparams.dropout = 0.0 hparams.clip_grad_norm = 0. # i.e. no gradient clipping hparams.optimizer_adam_epsilon = 1e-9 hparams.learning_rate_decay_scheme = "noam" hparams.learning_rate = 0.1 hparams.learning_rate_warmup_steps = 4000 hparams.initializer_gain = 1.0 hparams.num_hidden_layers = 6 hparams.initializer = "uniform_unit_scaling" hparams.weight_decay = 0.0 hparams.optimizer_adam_beta1 = 0.9 hparams.optimizer_adam_beta2 = 0.98 hparams.num_sampled_classes = 0 hparams.label_smoothing = 0.1 hparams.shared_embedding_and_softmax_weights = int(True)

hparams.add_hparam("filter_size", 2048) # Add new ones like this. # attention-related flags hparams.add_hparam("num_heads", 8) hparams.add_hparam("attention_key_channels", 0) hparams.add_hparam("attention_value_channels", 0) hparams.add_hparam("ffn_layer", "conv_hidden_relu") hparams.add_hparam("parameter_attention_key_channels", 0) hparams.add_hparam("parameter_attention_value_channels", 0) # All hyperparameters ending in "dropout" are automatically set to 0.0 # when not in training mode. hparams.add_hparam("attention_dropout", 0.0) hparams.add_hparam("relu_dropout", 0.0) hparams.add_hparam("residual_dropout", 0.1) hparams.add_hparam("pos", "timing") # timing, none hparams.add_hparam("nbr_decoder_problems", 1) return hparams

loss

The data is split into 100 parts. The loss is too low and the BLEU on newstest2013 is only 20.37 (multi-bleu.perl).
Can this configuration achieve the performance in the paper? Is it enough for 140K steps on 8 GPUs? Why attention_dropout and relu_dropout are set to 0? Does this hurt BLEU?

INFO:tensorflow:Evaluation [1/20] INFO:tensorflow:Evaluation [2/20] INFO:tensorflow:Evaluation [3/20] INFO:tensorflow:Evaluation [4/20] INFO:tensorflow:Evaluation [5/20] INFO:tensorflow:Evaluation [6/20] INFO:tensorflow:Evaluation [7/20] INFO:tensorflow:Evaluation [8/20] INFO:tensorflow:Evaluation [9/20] INFO:tensorflow:Evaluation [10/20] INFO:tensorflow:Evaluation [11/20] INFO:tensorflow:Evaluation [12/20] INFO:tensorflow:Evaluation [13/20] INFO:tensorflow:Evaluation [14/20] INFO:tensorflow:Evaluation [15/20] INFO:tensorflow:Evaluation [16/20] INFO:tensorflow:Evaluation [17/20] INFO:tensorflow:Evaluation [18/20] INFO:tensorflow:Evaluation [19/20] INFO:tensorflow:Evaluation [20/20] INFO:tensorflow:Finished evaluation at 2017-06-28-04:44:23 INFO:tensorflow:Saving dict for global step 145673: global_step = 145673, loss = 0.787518, metrics-wmt_ende_tokens_32k/accuracy = 0.8182, metrics-wmt_ende_tokens_32k/accuracy_per_sequence = 0.00633413, metrics-wmt_ende_tokens_32k/accuracy_top5 = 0.925367, metrics-wmt_ende_tokens_32k/bleu_score = 0.496107, metrics-wmt_ende_tokens_32k/neg_log_perplexity = -0.944883, metrics/accuracy = 0.8182, metrics/accuracy_per_sequence = 0.00633413, metrics/accuracy_top5 = 0.925367, metrics/bleu_score = 0.496107, metrics/neg_log_perplexity = -0.944883

lukaszkaiser commented 7 years ago

@zxw866 : that looks like a very strong model!

@neverdoubt : when you say "=> actual bleu : 23.85 on newstest2013", how do you measure that exactly? Do you use MOSES scripts, the recent version? Remember that newstest2014 is often 0.5 BLEU or more higher than '13, could you run on that? There is also the hyphenation-split issue which can be around 0.2 difference. We should probably replicate the BLEU calculation we use somewhere too.

Ah, also, we average the last 20 checkpoints with this script: https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/avg_checkpoints.py

Did you try that? Let's get your results to the same level as ours!

But with your hardware guys, you should try transformer_big too!

neverdoubt commented 7 years ago

I used recent MOSES multi-bleu.perl (https://raw.githubusercontent.com/moses-smt/mosesdecoder/master/scripts/generic/multi-bleu.perl)

actual bleu scores are computed using newstest2013.en file (which is dev set) i expect 25.8 (as is in table 3) from my base trained model. Anyway i'll try big model.

zxw866 commented 7 years ago

Although “metrics-wmt_ende_tokens_32k/bleu_score = 0.496107“ is high, the BLEU on newstest2013 is only 20.37 (multi-bleu.perl). Can this configuration achieve 25.8 in the paper ? Is it enough for 140K steps on 8 GPUs ?

mehmedes commented 7 years ago

@neverdoubt Did you use newstest2013.en without preprocessing or did you postprocess the Tensor2Tensor output before BLEU scoring? I think multi-bleu needs the source and reference to be tokenized...

zxw866 commented 7 years ago

The result of data generation in Walkthrough is about 1400M, which is double size of BPE training sets. I'm guessing '_' is used as an independent token in sentences, which led to the very low loss. As shown in the data generation process:

image

I wonder if this is a BUG?

lukaszkaiser commented 7 years ago

It turns out that the separate "_" was a bug introduced inadvertedly in a recent PR by Villi (see the chat on Gitter). We didn't have it before, so it might be responsible for some of the lower BLEU, but maybe not that much -- we should correct it in any case.

Another point is that all results in the paper are obtained with checkpoint averaging. Use the avg_checkpoints script from utils on the last 20 checkpoints that are saved in your $TRAIN_DIR. It's like a poor-man's version of Polyak averaging, but it's needed to reproduce our results (we're planning to add true Polyak averaging to the trainer at a later point).

And then you need to (1) tokenize the newstest and the (separated) decodes: perl ~/mosesdecoder/scripts/tokenizer/tokenizer.perl -l de < $decodes > $decodes_file.tok (2) Split on hyphens to be compatible with BLEU scores from other papers:

Put compounds in ATAT format (comparable to GNMT, ConvS2S)

perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $tok_gold_targets > $tok_gold_targets.atat perl -ple 's{(\S)-(\S)}{$1 ##AT##-##AT## $2}g' < $decodes_file.target > $decodes_file.atat (3) Finally run multi-bleu: perl ~/mosesdecoder/scripts/generic/multi-bleu.perl $tok_gold_targets.atat < $decodes_file.atat

Especially doing the averaging and tokenizing (1) is important, detokenized BLEU is often quite a bit lower than tokenized one.

neverdoubt commented 7 years ago

300k step trained model using 4 titan X 4th row in (C) model (d_model = 256, d_k=32, d_v=32) newstest2013.{en,de} bleu 24.2 The paper said 24.5 without averaging. Now we are in the same level.

btw, after 20 checkpoint averaging, i got 24.78

vthorsteinsson commented 7 years ago

Was this with the newest version of T2T, i.e. 1.08? The one with the separate underscores? Would be nice to get a confirmation that those don't necessarily hurt model performance (and may even make it better ;-) )

neverdoubt commented 7 years ago

@vthorsteinsson I used 1.0.7 for training (which has separate '_' issue), but my training data was created earlier version (maybe 1.0.2 or 1.0.4).

lukaszkaiser commented 7 years ago

I added utils/get_ende_blue.sh in 1.0.9. This is a script that includes the commands we used to get BLEU in the paper. You might need to fix a path to MOSES and tokenized newstest2013 there.

Please: could you average your checkpoints with utils/avg_checkpoints.py and then run utils/get_ende_blue.sh and report back the results? Just to make sure where your models really stand compared to our results, even despite possible tokenization differences. Thanks!

zxw866 commented 7 years ago

When using BPE training set, i got 24.64 on newstest2013. It's close to the results in the paper. Next i will try utils/get_ende_blue.sh. Thanks!

lukaszkaiser commented 7 years ago

Just remember that wmt_ende_bpe32k is tokenized, so instead of the tokenizer call in the get_ende_bleu.sh script, do this: perl -ple 's{@@ }{}g' > $decodes_file.target. Also, did you average checkpoints? Let us know what numbers you get!

zxw866 commented 7 years ago

transformer_base hparams. 110k steps using 1 titan Xp, then 140k steps using 8 titan Xp. I averaged 7 checkpoints. I removed '@@' using "sed -r 's/(@@ )|(@@ ?$)//g'". Then I got 24.64 on newstest2013. I'm guessing the learning rate decay was affected in my experiment. Next I plan to run the big model. I really appreciate your help!

lukaszkaiser commented 7 years ago

That looks reasonable. Did you run get_ende_bleu, I mean esp. the "atat" part? That can be 0.2 or 0.3 BLEU if you forget it.

tobyyouup commented 7 years ago

Hi @lukaszkaiser I have read the discussion above and found the details for calculation BLEU in https://github.com/tensorflow/tensor2tensor/issues/44 is different. So I want to make sure something:

What the format is needed for the decoding file (--decode_from_file=$DECODE_FILE, such as newstest2013 or newstest 2014), Do I need to do tokenizaton and put compounds in ATAT format before feeding into the decoding process? Do I also need to do bpe? Or just use the raw text sentences?

lukaszkaiser commented 7 years ago

The input file should be detokenized, pure text. (Except if you do BPE, but I suggest trying without it.)

tobyyouup commented 7 years ago

@lukaszkaiser When I feed pure test for decoding and checkpoint average, I can get a BLEU score 26 on the base configuration model.

lukaszkaiser commented 7 years ago

That sounds reasonable. I'm closing this issue for now as it's gotten long and tokenization changed in 1.0.11. I hope things are ok now, but please either re-open or make a new issue if you see the problems again!