🐛 Bug

Hey, The checkpoint's suffix is the last optimization step rather than the last global step (I'm working with accumulation steps)

Information

The problem arises when using:

[ ] the official example scripts: language_modeling.py

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: language_modeling.py

To reproduce

Steps to reproduce the behavior:

run a model on language_modeling.py script with an accumulation step > 0
save a checkpoint after x > 0 steps and exit
try to continue training and it will continue from the last optimization step rather than global step

roberta-base-openai-detector, roberta-large-openai-detector). Assuming 'tmlm_roberta_output/checkpoint-480' is a path, a model identifier, or url to a directory containing tokenizer files.
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   Didn't find file tmlm_roberta_output/checkpoint-480/added_tokens.json. We won't load it.
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/vocab.json
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/merges.txt
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file None
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/special_tokens_map.json
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/tokenizer_config.json
02/26/2020 07:39:49 - INFO - transformers.modeling_utils -   loading weights file tmlm_roberta_output/checkpoint-480/pytorch_model.bin
init tud head
02/26/2020 07:40:08 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=512, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/wiki.test.raw.time_filter.normalized', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=64, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path='tmlm_roberta_output/checkpoint-480', model_type='roberta', n_gpu=4, no_cuda=False, num_train_epochs=1.0, output_dir='tmlm_roberta_output', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=1, save_steps=80, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=True, tokenizer_name=None, train_data_file='/specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/wiki.train.raw.time_filter.normalized', warmup_steps=0, weight_decay=0.0)
02/26/2020 07:40:08 - INFO - __main__ -   Loading features from cached file /specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/roberta_cached_lm_510_wiki.train.raw.time_filter.normalized
02/26/2020 07:40:16 - INFO - __main__ -   ***** Running training *****
02/26/2020 07:40:16 - INFO - __main__ -     Num examples = 163046
02/26/2020 07:40:16 - INFO - __main__ -     Num Epochs = 1
02/26/2020 07:40:16 - INFO - __main__ -     Instantaneous batch size per GPU = 1
02/26/2020 07:40:16 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 256
02/26/2020 07:40:16 - INFO - __main__ -     Gradient Accumulation steps = 64
02/26/2020 07:40:16 - INFO - __main__ -     Total optimization steps = 636
02/26/2020 07:40:16 - INFO - __main__ -     Continuing training from checkpoint, will skip to saved global_step
02/26/2020 07:40:16 - INFO - __main__ -     Continuing training from epoch 0
02/26/2020 07:40:16 - INFO - __main__ -     Continuing training from global step 480
02/26/2020 07:40:16 - INFO - __main__ -     Will skip the first 480 steps in the first epoch

Expected behavior

I expect it to run from the last global step. I.e. optimization steps * gradient accumulation steps. Note that optimization steps == checkpoint suffix.

I made the following changes and it seems to work ok:

Former code:


            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)```
New code:
            ```global_step = int(checkpoint_suffix) * args.gradient_accumulation_steps
            epochs_trained = global_step // len(train_dataloader)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader))```

- `transformers` version: latest
- Platform:
- Python version: 3.7
- PyTorch version (GPU?): GPU
- Using GPU in script?: yes

huggingface / transformers

language_modeling.py doesn't continue from last global step #3026

🐛 Bug

Information

To reproduce

Expected behavior

I made the following changes and it seems to work ok: