Open vakadanaveen opened 2 years ago
Hi, Can anyone help me dealing with loss values being nan when I run run_captioning.py program
python oscar/run_captioning.py \ --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997/\ --do_train \ --do_lower_case \ --evaluate_during_training \ --add_od_labels \ --learning_rate 0.00001 \ --per_gpu_train_batch_size 32 \ --num_train_epochs 30 \ --save_steps 5000 \ --output_dir output/
2021-12-02 15:26:48,909 vlpretrain WARNING: Device: cuda, n_gpu: 1 2021-12-02 15:27:07,873 vlpretrain INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, add_od_labels=True, cider_cached_tokens='coco-train-words.p', config_name='', data_dir='/lfs/usrhome/ms/cs20s012/scratch/dataset/coco_caption', device=device(type='cuda'), distributed=False, do_eval=False, do_lower_case=True, do_test=False, do_train=True, drop_out=0.1, drop_worst_after=0, drop_worst_ratio=0, eval_model_dir='', evaluate_during_training=True, freeze_embedding=True, gradient_accumulation_steps=1, img_feature_dim=2054, img_feature_type='frcnn', label_smoothing=0, learning_rate=3e-05, length_penalty=1, local_rank=0, logging_steps=20, loss_type='sfmx', mask_prob=0.15, max_gen_length=20, max_grad_norm=1.0, max_img_seq_length=50, max_masked_tokens=3, max_seq_a_length=40, max_seq_length=70, max_steps=-1, min_constraints_to_satisfy=2, model_name_or_path='/lfs/usrhome/ms/cs20s012/scratch/models/base-vg-labels/ep_67_588997/', no_cuda=False, num_beams=5, num_gpus=1, num_keep_best=1, num_labels=2, num_return_sequences=1, num_train_epochs=1, num_workers=4, output_dir='output/', output_hidden_states=False, output_mode='classification', per_gpu_eval_batch_size=64, per_gpu_train_batch_size=64, repetition_penalty=1, save_steps=-1, sc_baseline_type='greedy', sc_beam_size=1, sc_train_sample_n=5, scheduler='linear', scst=False, seed=88, temperature=1, test_yaml='test.yaml', tie_weights=True, tokenizer_name='', top_k=0, top_p=1, train_yaml='train.yaml', use_cbs=False, val_yaml='val.yaml', warmup_steps=0, weight_decay=0.05) 2021-12-02 15:27:12,156 vlpretrain INFO: Train with 64 images per GPU. 2021-12-02 15:27:12,157 vlpretrain INFO: Total batch size 64 2021-12-02 15:27:12,158 vlpretrain INFO: Total training steps 8855 2021-12-02 15:27:12,744 vlpretrain INFO: Running training 2021-12-02 15:27:12,745 vlpretrain INFO: Num Epochs = 1 2021-12-02 15:27:12,746 vlpretrain INFO: Batch size per GPU = 64 2021-12-02 15:27:12,746 vlpretrain INFO: Total train batch size (w. parallel, & accumulation) = 64 2021-12-02 15:27:12,747 vlpretrain INFO: Gradient Accumulation steps = 1 2021-12-02 15:27:12,747 vlpretrain INFO: Total optimization steps = 8856 2021-12-02 15:27:25,829 vlpretrain INFO: Epoch: 0, global_step: 20, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0051) 2021-12-02 15:27:35,468 vlpretrain INFO: Epoch: 0, global_step: 40, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0026) 2021-12-02 15:27:45,042 vlpretrain INFO: Epoch: 0, global_step: 60, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0017) 2021-12-02 15:27:54,374 vlpretrain INFO: Epoch: 0, global_step: 80, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0013) 2021-12-02 15:28:03,717 vlpretrain INFO: Epoch: 0, global_step: 100, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0010) 2021-12-02 15:28:13,088 vlpretrain INFO: Epoch: 0, global_step: 120, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0009) 2021-12-02 15:28:22,432 vlpretrain INFO: Epoch: 0, global_step: 140, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0007) 2021-12-02 15:28:31,760 vlpretrain INFO: Epoch: 0, global_step: 160, lr: 0.000029, loss: nan (nan), score: 0.0000 (0.0006)
Please help me.
Hi, Can anyone help me dealing with loss values being nan when I run run_captioning.py program
python oscar/run_captioning.py \ --model_name_or_path pretrained_models/base-vg-labels/ep_67_588997/\ --do_train \ --do_lower_case \ --evaluate_during_training \ --add_od_labels \ --learning_rate 0.00001 \ --per_gpu_train_batch_size 32 \ --num_train_epochs 30 \ --save_steps 5000 \ --output_dir output/
2021-12-02 15:26:48,909 vlpretrain WARNING: Device: cuda, n_gpu: 1 2021-12-02 15:27:07,873 vlpretrain INFO: Training/evaluation parameters Namespace(adam_epsilon=1e-08, add_od_labels=True, cider_cached_tokens='coco-train-words.p', config_name='', data_dir='/lfs/usrhome/ms/cs20s012/scratch/dataset/coco_caption', device=device(type='cuda'), distributed=False, do_eval=False, do_lower_case=True, do_test=False, do_train=True, drop_out=0.1, drop_worst_after=0, drop_worst_ratio=0, eval_model_dir='', evaluate_during_training=True, freeze_embedding=True, gradient_accumulation_steps=1, img_feature_dim=2054, img_feature_type='frcnn', label_smoothing=0, learning_rate=3e-05, length_penalty=1, local_rank=0, logging_steps=20, loss_type='sfmx', mask_prob=0.15, max_gen_length=20, max_grad_norm=1.0, max_img_seq_length=50, max_masked_tokens=3, max_seq_a_length=40, max_seq_length=70, max_steps=-1, min_constraints_to_satisfy=2, model_name_or_path='/lfs/usrhome/ms/cs20s012/scratch/models/base-vg-labels/ep_67_588997/', no_cuda=False, num_beams=5, num_gpus=1, num_keep_best=1, num_labels=2, num_return_sequences=1, num_train_epochs=1, num_workers=4, output_dir='output/', output_hidden_states=False, output_mode='classification', per_gpu_eval_batch_size=64, per_gpu_train_batch_size=64, repetition_penalty=1, save_steps=-1, sc_baseline_type='greedy', sc_beam_size=1, sc_train_sample_n=5, scheduler='linear', scst=False, seed=88, temperature=1, test_yaml='test.yaml', tie_weights=True, tokenizer_name='', top_k=0, top_p=1, train_yaml='train.yaml', use_cbs=False, val_yaml='val.yaml', warmup_steps=0, weight_decay=0.05) 2021-12-02 15:27:12,156 vlpretrain INFO: Train with 64 images per GPU. 2021-12-02 15:27:12,157 vlpretrain INFO: Total batch size 64 2021-12-02 15:27:12,158 vlpretrain INFO: Total training steps 8855 2021-12-02 15:27:12,744 vlpretrain INFO: Running training 2021-12-02 15:27:12,745 vlpretrain INFO: Num Epochs = 1 2021-12-02 15:27:12,746 vlpretrain INFO: Batch size per GPU = 64 2021-12-02 15:27:12,746 vlpretrain INFO: Total train batch size (w. parallel, & accumulation) = 64 2021-12-02 15:27:12,747 vlpretrain INFO: Gradient Accumulation steps = 1 2021-12-02 15:27:12,747 vlpretrain INFO: Total optimization steps = 8856 2021-12-02 15:27:25,829 vlpretrain INFO: Epoch: 0, global_step: 20, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0051) 2021-12-02 15:27:35,468 vlpretrain INFO: Epoch: 0, global_step: 40, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0026) 2021-12-02 15:27:45,042 vlpretrain INFO: Epoch: 0, global_step: 60, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0017) 2021-12-02 15:27:54,374 vlpretrain INFO: Epoch: 0, global_step: 80, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0013) 2021-12-02 15:28:03,717 vlpretrain INFO: Epoch: 0, global_step: 100, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0010) 2021-12-02 15:28:13,088 vlpretrain INFO: Epoch: 0, global_step: 120, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0009) 2021-12-02 15:28:22,432 vlpretrain INFO: Epoch: 0, global_step: 140, lr: 0.000030, loss: nan (nan), score: 0.0000 (0.0007) 2021-12-02 15:28:31,760 vlpretrain INFO: Epoch: 0, global_step: 160, lr: 0.000029, loss: nan (nan), score: 0.0000 (0.0006)
Please help me.