Closed ghost closed 3 years ago
Hi, I ran your codes with different settings but got unexpected results that the model with PN performs worse than the model with LN. The results are shown as following: Transformer with LN:
Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0) | [de] dictionary: 10152 types | [en] dictionary: 10152 types | loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de | loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en | data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples | loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_layer_layer_layer_layer_warm/averaged_model.pt | Translated 6750 sentences (148676 tokens) in 105.6s (63.91 sentences/s, 1407.62 tokens/s) | Generate test with beam=5: BLEU4 = 35.44, 69.6/44.1/30.0/20.7 (BP=0.954, ratio=0.955, syslen=125196, reflen=131156)
Transformer with PN:
Namespace(beam=5, bpe=None, cpu=False, criterion='cross_entropy', data='data-bin/iwslt14.tokenized.de-en.joined/', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format='simple', log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=128, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=1, optimizer='nag', path='log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=True, raw_text=False, remove_bpe='@@ ', replace_unk=None, required_batch_size_multiple=8, results_path=None, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang='de', target_lang='en', task='translation', tbmf_wrapper=False, temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, unkpen=0, unnormalized=False, upsample_primary=1, user_dir=None, warmup_updates=0, weight_decay=0.0) | [de] dictionary: 10152 types | [en] dictionary: 10152 types | loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.de | loaded 6750 examples from: data-bin/iwslt14.tokenized.de-en.joined/test.de-en.en | data-bin/iwslt14.tokenized.de-en.joined/ test de-en 6750 examples | loading model(s) from log/iwslt14_de_en/transformer_iwslt_de_en_v2_power_power_power_power_warm/averaged_model.pt | Translated 6750 sentences (148074 tokens) in 122.0s (55.34 sentences/s, 1214.00 tokens/s) | Generate test with beam=5: BLEU4 = 35.27, 69.6/44.0/29.8/20.6 (BP=0.953, ratio=0.954, syslen=125107, reflen=131156)
Looking forward to your reply.
Hi there, sorry for the late reply. Can you elaborate on the specific environment that you are running the scripts from and the training epochs.
Hi, I ran your codes with different settings but got unexpected results that the model with PN performs worse than the model with LN. The results are shown as following: Transformer with LN:
Transformer with PN:
Looking forward to your reply.