teslacool / SCA

Soft Contextual Data Augmentation
Other
39 stars 9 forks source link

*inf* value for ppl #4

Closed nicolabertoldi closed 5 years ago

nicolabertoldi commented 5 years ago

before the last part of the log of one of my training of the language model

Why the ppl reports a inf value?

| epoch 043 | loss 2837.172 | ppl inf | wps 22040 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 39302 | lr 2.5e-05 | gnorm 17885.347 | clip 1.000 | oom 0.000 | wall 9326 | train_wall 8854
| epoch 043 | valid on 'valid' subset | loss 2322.051 | ppl inf | num_updates 39302 | best_loss 2322.05
| epoch 044 | loss 2819.246 | ppl inf | wps 22042 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 40216 | lr 2.5e-05 | gnorm 19552.845 | clip 1.000 | oom 0.000 | wall 9543 | train_wall 9060
| epoch 044 | valid on 'valid' subset | loss 2272.617 | ppl inf | num_updates 40216 | best_loss 2272.62
| epoch 045 | loss 2802.761 | ppl inf | wps 22039 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 41130 | lr 2.5e-05 | gnorm 354250.108 | clip 1.000 | oom 0.000 | wall 9761 | train_wall 9266
| epoch 045 | valid on 'valid' subset | loss 2269.807 | ppl inf | num_updates 41130 | best_loss 2269.81
| epoch 046 | loss 2782.943 | ppl inf | wps 22041 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 42044 | lr 2.5e-05 | gnorm 30559.840 | clip 1.000 | oom 0.000 | wall 9978 | train_wall 9472
| epoch 046 | valid on 'valid' subset | loss 2250.028 | ppl inf | num_updates 42044 | best_loss 2250.03
| epoch 047 | loss 2769.120 | ppl inf | wps 22042 | ups 4 | wpb 5116.162 | bsz 4.997 | num_updates 42958 | lr 2.5e-05 | gnorm 18006.640 | clip 1.000 | oom 0.000 | wall 10196 | train_wall 9678
| epoch 047 | valid on 'valid' subset | loss 2268.014 | ppl inf | num_updates 42958 | best_loss 2250.03
teslacool commented 5 years ago

I suspect there is a wrong training hyper-parameter for your lm training.

nicolabertoldi commented 5 years ago

@teslacool

Do you have any idea whether my setting (see below) is somehow wrong?

Namespace(adaptive_input=False, adaptive_input_cutoff=None, adaptive_input_factor=4, adaptive_softmax_cutoff=None, adaptive_softmax_dropout=0, adaptive_softmax_factor=4, arch='transformer_lm', attention_dropout=0.0, bucket_cap_mb=25, char_embedder_highway_layers=2, character_embedding_dim=4, character_embeddings=False, character_filters='[(1, 64), (2, 128), (3, 192), (4, 256), (5, 256), (6, 256), (7, 256)]', clip_norm=25, cpu=False, criterion='cross_entropy', curriculum=0, data='/data/workspace/SoftContextualDataAugmentation/experiments/data_generated_sl', ddp_backend='c10d', decoder_attention_heads=8, decoder_embed_dim=512, decoder_ffn_embed_dim=2048, decoder_input_dim=512, decoder_layers=6, decoder_learned_pos=False, decoder_normalize_before=True, decoder_output_dim=512, device_id=0, distributed_backend='nccl', distributed_init_method=None, distributed_port=-1, distributed_rank=0, distributed_world_size=1, dropout=0.1, fix_batches_to_gpus=False, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, future_target=False, keep_interval_updates=-1, keep_last_epochs=-1, lazy_load=False, log_format=None, log_interval=1000, lr=[0.25], lr_scheduler='reduce_lr_on_plateau', lr_shrink=0.1, max_epoch=0, max_sentences=None, max_sentences_valid=None, max_tokens=6000, max_update=0, memory_efficient_fp16=False, min_loss_scale=0.0001, min_lr=1e-05, momentum=0.99, no_epoch_checkpoints=False, no_progress_bar=False, no_save=False, no_token_positional_embeddings=False, num_workers=0, optimizer='nag', optimizer_overrides='{}', output_dictionary_size=-1, past_target=False, raw_text=False, relu_dropout=0.0, required_batch_size_multiple=8, reset_lr_scheduler=False, reset_optimizer=False, restore_file='checkpoint_last.pt', sample_break_mode=None, save_dir='/data/workspace/SoftContextualDataAugmentation/experiments/lm_sl', save_interval=1, save_interval_updates=0, seed=1, self_target=False, sentence_avg=False, share_decoder_input_output_embed=False, skip_invalid_size_inputs_valid_test=False, task='language_modeling', tensorboard_logdir='', threshold_loss_scale=None, tie_adaptive_proj=False, tie_adaptive_weights=False, tokens_per_sample=1024, train_subset='train', update_freq=[1], user_dir=None, valid_subset='valid', validate_interval=1, weight_decay=0.0)
| dictionary: 32456 types
| /data/workspace/SoftContextualDataAugmentation/experiments/data_generated_sl train 4567 examples
| /data/workspace/SoftContextualDataAugmentation/experiments/data_generated_sl valid 47 examples
teslacool commented 5 years ago

Sorry, I do not have too much experience on lm training. Using the same hyper-parameter as nmt is no problem in my experiments.