Hey, The checkpoint's suffix is the last optimization step rather than the last global step (I'm working with accumulation steps)
Information
The problem arises when using:
[ ] the official example scripts: language_modeling.py
The tasks I am working on is:
[ ] an official GLUE/SQUaD task: language_modeling.py
To reproduce
Steps to reproduce the behavior:
run a model on language_modeling.py script with an accumulation step > 0
save a checkpoint after x > 0 steps and exit
try to continue training and it will continue from the last optimization step rather than global step
roberta-base-openai-detector, roberta-large-openai-detector). Assuming 'tmlm_roberta_output/checkpoint-480' is a path, a model identifier, or url to a directory containing tokenizer files.
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils - Didn't find file tmlm_roberta_output/checkpoint-480/added_tokens.json. We won't load it.
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils - loading file tmlm_roberta_output/checkpoint-480/vocab.json
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils - loading file tmlm_roberta_output/checkpoint-480/merges.txt
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils - loading file None
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils - loading file tmlm_roberta_output/checkpoint-480/special_tokens_map.json
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils - loading file tmlm_roberta_output/checkpoint-480/tokenizer_config.json
02/26/2020 07:39:49 - INFO - transformers.modeling_utils - loading weights file tmlm_roberta_output/checkpoint-480/pytorch_model.bin
init tud head
02/26/2020 07:40:08 - INFO - __main__ - Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=512, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/wiki.test.raw.time_filter.normalized', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=64, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path='tmlm_roberta_output/checkpoint-480', model_type='roberta', n_gpu=4, no_cuda=False, num_train_epochs=1.0, output_dir='tmlm_roberta_output', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=1, save_steps=80, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=True, tokenizer_name=None, train_data_file='/specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/wiki.train.raw.time_filter.normalized', warmup_steps=0, weight_decay=0.0)
02/26/2020 07:40:08 - INFO - __main__ - Loading features from cached file /specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/roberta_cached_lm_510_wiki.train.raw.time_filter.normalized
02/26/2020 07:40:16 - INFO - __main__ - ***** Running training *****
02/26/2020 07:40:16 - INFO - __main__ - Num examples = 163046
02/26/2020 07:40:16 - INFO - __main__ - Num Epochs = 1
02/26/2020 07:40:16 - INFO - __main__ - Instantaneous batch size per GPU = 1
02/26/2020 07:40:16 - INFO - __main__ - Total train batch size (w. parallel, distributed & accumulation) = 256
02/26/2020 07:40:16 - INFO - __main__ - Gradient Accumulation steps = 64
02/26/2020 07:40:16 - INFO - __main__ - Total optimization steps = 636
02/26/2020 07:40:16 - INFO - __main__ - Continuing training from checkpoint, will skip to saved global_step
02/26/2020 07:40:16 - INFO - __main__ - Continuing training from epoch 0
02/26/2020 07:40:16 - INFO - __main__ - Continuing training from global step 480
02/26/2020 07:40:16 - INFO - __main__ - Will skip the first 480 steps in the first epoch
Expected behavior
I expect it to run from the last global step. I.e. optimization steps * gradient accumulation steps. Note that optimization steps == checkpoint suffix.
I made the following changes and it seems to work ok:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
🐛 Bug
Hey, The checkpoint's suffix is the last optimization step rather than the last global step (I'm working with accumulation steps)
Information
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Expected behavior
I expect it to run from the last global step. I.e. optimization steps * gradient accumulation steps. Note that optimization steps == checkpoint suffix.
I made the following changes and it seems to work ok:
Former code: