sIncerass / powernorm

[ICML 2020] code for "PowerNorm: Rethinking Batch Normalization in Transformers" https://arxiv.org/abs/2003.07845
GNU General Public License v3.0
119 stars 17 forks source link

Cannot reproduce the results on IWSLT14. #7

Closed ghost closed 3 years ago

ghost commented 3 years ago

Hi, I ran your codes with different settings but got unexpected results that the model with PN performs worse than the model with LN. The results are shown as following: Transformer with LN:

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt
| Translated 6750 sentences (148676 tokens) in 105.6s (63.91 sentences/s, 1407.62 tokens/s)
| Generate test with beam=5: BLEU4 = 35.44, 69.6/44.1/30.0/20.7 (BP=0.954, ratio=0.955, syslen=125196, reflen=131156)

Transformer with PN:

Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0)
| [de] dictionary: 10152 types
| [en] dictionary: 10152 types
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de
| loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en
| data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples
| loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt
| Translated 6750 sentences (148074 tokens) in 122.0s (55.34 sentences/s, 1214.00 tokens/s)
| Generate test with beam=5: BLEU4 = 35.27, 69.6/44.0/29.8/20.6 (BP=0.953, ratio=0.954, syslen=125107, reflen=131156)

Looking forward to your reply.

sIncerass commented 3 years ago

Hi there, sorry for the late reply. Can you elaborate on the specific environment that you are running the scripts from and the training epochs.