Open Imposingapple opened 3 years ago
Have you initialized the model with pre-trained parameters from MASS?
我在s2s_model.py中看到了调用,确实是已经用下载下来的权重初始化过了。请问还有什么可能的导致效果没有您文章里好的原因?请指教,谢谢!
I download the MASS parameters from your link, and at the beginning of the training, the perplexity is at the magnitude of 10^6. Does it means that the initialization function in s2s_model.py fails to initialize the parameters? I saw the scripts in s2s_model.py to initialize the model parameters.
yeap, the initial perplexity shouldn't be that high. have you noticed any warning message about model initialization, something like "modules that are not initialized"?
Here's all the terminal outputs after running the script 'CUDA_VISIBLE_DEVICES=0 ./train_mix_CNN_NYT_X.sh --style humor', haven't seen any wrong signals:
args.distributed_init_method: None args: activation_dropout = 0.1 activation_fn = gelu adam_betas = (0.9, 0.98) adam_eps = 1e-08 adaptive_input = False adaptive_input_cutoff = None adaptive_input_factor = 4 adaptive_softmax_cutoff = None adaptive_softmax_dropout = 0 adaptive_softmax_factor = 4 arch = transformer_mix_base attention_dropout = 0.1 best_checkpoint_metric = loss bpe = None bucket_cap_mb = 25 char_embedder_highway_layers = 2 character_embedding_dim = 4 character_filters = [(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)] clip_norm = 0.0 cpu = False criterion = label_smoothed_cross_entropy curriculum = 0 dae_styles = humor data = data/CNN_NYT/processed:data/humor/processed dataset_impl = None ddp_backend = no_c10d decoder_attention_heads = 12 decoder_embed_dim = 768 decoder_ffn_embed_dim = 3072 decoder_langtok = False decoder_layerdrop = 0 decoder_layers = 6 decoder_layers_to_keep = None decoder_normalize_before = False decoder_output_dim = 768 device_id = 0 disable_validation = False distributed_backend = nccl distributed_init_method = None distributed_no_spawn = False distributed_port = -1 distributed_rank = 0 distributed_world_size = 1 divide_decoder_embed_norm = False divide_decoder_encoder_attn_norm = False divide_decoder_encoder_attn_query = True divide_decoder_final_norm = True divide_decoder_self_attn_norm = True divide_decoder_self_attn_query = False dropout = 0.2 empty_cache_freq = 0 encoder_attention_heads = 12 encoder_embed_dim = 768 encoder_ffn_embed_dim = 3072 encoder_langtok = None encoder_layers = 6 fast_stat_sync = False find_unused_parameters = False fix_batches_to_gpus = False fixed_validation_seed = None fp16 = True fp16_init_scale = 128 fp16_scale_tolerance = 0.0 fp16_scale_window = None keep_interval_updates = -1 keep_last_epochs = -1 label_smoothing = 0.1 lambda_denoising_config = 0.5 lambda_parallel_config = 0.5 lang_pairs = src-tgt layernorm_embedding = True lazy_load = False left_pad_source = True left_pad_target = False load_alignments = False load_from_pretrained_model = pretrained_model/MASS/mass-base-uncased.pt log_format = None log_interval = 1000 lr = [0.0005] lr_scheduler = inverse_sqrt max_epoch = 6 max_sentences = None max_sentences_valid = None max_source_positions = 512 max_target_positions = 512 max_tokens = 3072 max_tokens_valid = 3072 max_update = 0 max_word_shuffle_distance = 5.0 maximize_best_checkpoint_metric = False memory_efficient_fp16 = False min_loss_scale = 0.0001 min_lr = 1e-09 model_lang_pairs = ['src-tgt', 'humor-humor'] no_decoder_final_norm = True no_epoch_checkpoints = False no_last_checkpoints = False no_progress_bar = False no_save = False no_save_optimizer_state = False no_scale_embedding = False num_workers = 1 optimizer = adam optimizer_overrides = {} raw_text = False required_batch_size_multiple = 8 reset_dataloader = False reset_lr_scheduler = False reset_meters = False reset_optimizer = False restore_file = checkpoint_last.pt save_dir = tmp/humor save_interval = 1 save_interval_updates = 0 seed = 1 sentence_avg = False share_all_embeddings = True share_decoder_input_output_embed = True skip_invalid_size_inputs_valid_test = True source_lang = None target_lang = None task = translation_mix tensorboard_logdir = threshold_loss_scale = None tie_adaptive_proj = False tie_adaptive_weights = False tokenizer = None train_subset = train truncate_source = False update_freq = [4] upsample_primary = 1 use_bmuf = False user_dir = mass valid_subset = valid validate_interval = 1 warmup_init_lr = 1e-07 warmup_updates = 4000 weight_decay = 0.0 word_blanking_prob = 0.2 word_dropout_prob = 0.2 | [src] dictionary: 30522 types | [tgt] dictionary: 30522 types | [humor] dictionary: 30522 types | loaded 3000 examples from: data/CNN_NYT/processed/valid.src-tgt.src | loaded 3000 examples from: data/CNN_NYT/processed/valid.src-tgt.tgt | data/CNN_NYT/processed valid src-tgt 3000 examples | loaded 9651 examples from: data/humor/processed/valid.humor-None.humor | loaded 9651 examples from: data/humor/processed/valid.humor-None.humor | denoising-humor: data/humor/processed valid 9651 examples loading pretrained model from: pretrained_model/MASS/mass-base-uncased.pt TransformerMixModel( (encoder): TransformerEncoder( (embed_tokens): Embedding(30522, 768, padding_idx=0) (embed_positions): LearnedPositionalEmbedding(513, 768, padding_idx=0) (layers): ModuleList( (0): TransformerEncoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (1): TransformerEncoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (2): TransformerEncoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (3): TransformerEncoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (4): TransformerEncoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (5): TransformerEncoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (emb_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (decoder): TransformerMixDecoder( (embed_tokens): Embedding(30522, 768, padding_idx=0) (embed_positions): LearnedPositionalEmbedding(513, 768, padding_idx=0) (layers): ModuleList( (0): TransformerMixDecoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (encoder_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): ModuleList( (0): Linear(in_features=768, out_features=768, bias=True) (1): Linear(in_features=768, out_features=768, bias=True) ) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (1): TransformerMixDecoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (encoder_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): ModuleList( (0): Linear(in_features=768, out_features=768, bias=True) (1): Linear(in_features=768, out_features=768, bias=True) ) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (2): TransformerMixDecoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (encoder_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): ModuleList( (0): Linear(in_features=768, out_features=768, bias=True) (1): Linear(in_features=768, out_features=768, bias=True) ) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (3): TransformerMixDecoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (encoder_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): ModuleList( (0): Linear(in_features=768, out_features=768, bias=True) (1): Linear(in_features=768, out_features=768, bias=True) ) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (4): TransformerMixDecoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (encoder_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): ModuleList( (0): Linear(in_features=768, out_features=768, bias=True) (1): Linear(in_features=768, out_features=768, bias=True) ) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) (5): TransformerMixDecoderLayer( (self_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): Linear(in_features=768, out_features=768, bias=True) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (self_attn_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) (encoder_attn): MultiheadAttention( (k_proj): Linear(in_features=768, out_features=768, bias=True) (v_proj): Linear(in_features=768, out_features=768, bias=True) (q_proj): ModuleList( (0): Linear(in_features=768, out_features=768, bias=True) (1): Linear(in_features=768, out_features=768, bias=True) ) (out_proj): Linear(in_features=768, out_features=768, bias=True) ) (encoder_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (fc1): Linear(in_features=768, out_features=3072, bias=True) (fc2): Linear(in_features=3072, out_features=768, bias=True) (final_layer_norm): ModuleList( (0): LayerNorm((768,), eps=1e-05, elementwise_affine=True) (1): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) ) (emb_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True) ) ) | model transformer_mix_base, criterion LabelSmoothedCrossEntropyCriterion | num. model params: 127031808 (num. trained: 127031808) | training on 1 GPUs | max tokens per GPU = 3072 and max sentences per GPU = None | no existing checkpoint found tmp/humor/checkpoint_last.pt | loading train data for epoch 0 | loaded 141135 examples from: data/CNN_NYT/processed/train.src-tgt.src | loaded 141135 examples from: data/CNN_NYT/processed/train.src-tgt.tgt | data/CNN_NYT/processed train src-tgt 141135 examples | loaded 480505 examples from: data/humor/processed/train.humor-None.humor | loaded 480505 examples from: data/humor/processed/train.humor-None.humor | denoising-humor: data/humor/processed train 480505 examples | WARNING: 10 samples have invalid sizes and will be skipped, max_positions=OrderedDict([('src-tgt', (512, 512)), ('humor-humor', (512, 512))]), first few sample ids=[141134, 282269, 423404, 480498, 480499, 480500, 480501, 480502, 480503, 480504] | epoch 001: 0%| | 0/3933 [00:00<?, ?it/s]| WARNING: overflow detected, setting loss scale to: 64.0 | epoch 001: 0%| | 1/3933 [00:01<1:46:55, 1.63s/it]| WARNING: overflow detected, setting loss scale to: 32.0 | epoch 001: 0%| | 2/3933 [00:02<1:06:20, 1.01s/it]| WARNING: overflow detected, setting loss scale to: 16.0 | epoch 001: 0%| | 3/3933 [00:02<56:02, 1.17it/s]| WARNING: overflow detected, setting loss scale to: 8.0 | epoch 001: 0%| | 4/3933 [00:03<51:01, 1.28it/s]
Have you run "evaluate_mix_CNN_NYT_X.sh"? You should use this file for final evaluation.
Yes, of course I did. The first screenshot of this issue is the result after running 'evaluate_mix_CNN_NYT_X.sh' for humor. The hypothesis file to evaluate is already detokenized (sentences with english words, do not have BPE marks).
Dear author, I'm sorry to bother you again. I could not figure out why there's discrepancy between my result and the paper's result yet. I'm sure to run the exact script and use the exact pretrained parameters you provide. Have you found if there's something wrong with the training or evaluation scripts? Or have you tried running this version of code or found some mistakes of my setting on the above screenshot? Thank you again for answering!
After doing the data preprocessing, training and evaluation for humorous headline generation provided by this repo, i get the following result. The BLEU score is 9.0, much lower than you have written in your paper(13.3). Is there anything wrong?
The BLEU score in my result is 9.79, have you ever solved this discrepancy?
After doing the data preprocessing, training and evaluation for humorous headline generation provided by this repo, i get the following result. The BLEU score is 9.0, much lower than you have written in your paper(13.3). Is there anything wrong?