huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
134.73k stars 26.94k forks source link

language_modeling.py doesn't continue from last global step #3026

Closed yuvalkirstain closed 4 years ago

yuvalkirstain commented 4 years ago

🐛 Bug

Hey, The checkpoint's suffix is the last optimization step rather than the last global step (I'm working with accumulation steps)

Information

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. run a model on language_modeling.py script with an accumulation step > 0
  2. save a checkpoint after x > 0 steps and exit
  3. try to continue training and it will continue from the last optimization step rather than global step
roberta-base-openai-detector, roberta-large-openai-detector). Assuming 'tmlm_roberta_output/checkpoint-480' is a path, a model identifier, or url to a directory containing tokenizer files.
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   Didn't find file tmlm_roberta_output/checkpoint-480/added_tokens.json. We won't load it.
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/vocab.json
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/merges.txt
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file None
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/special_tokens_map.json
02/26/2020 07:39:49 - INFO - transformers.tokenization_utils -   loading file tmlm_roberta_output/checkpoint-480/tokenizer_config.json
02/26/2020 07:39:49 - INFO - transformers.modeling_utils -   loading weights file tmlm_roberta_output/checkpoint-480/pytorch_model.bin
init tud head
02/26/2020 07:40:08 - INFO - __main__ -   Training/evaluation parameters Namespace(adam_epsilon=1e-08, block_size=512, cache_dir=None, config_name=None, device=device(type='cuda'), do_eval=True, do_train=True, eval_all_checkpoints=False, eval_data_file='/specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/wiki.test.raw.time_filter.normalized', evaluate_during_training=False, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=64, learning_rate=5e-05, line_by_line=False, local_rank=-1, logging_steps=500, max_grad_norm=1.0, max_steps=-1, mlm=True, mlm_probability=0.15, model_name_or_path='tmlm_roberta_output/checkpoint-480', model_type='roberta', n_gpu=4, no_cuda=False, num_train_epochs=1.0, output_dir='tmlm_roberta_output', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=4, per_gpu_train_batch_size=1, save_steps=80, save_total_limit=None, seed=42, server_ip='', server_port='', should_continue=True, tokenizer_name=None, train_data_file='/specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/wiki.train.raw.time_filter.normalized', warmup_steps=0, weight_decay=0.0)
02/26/2020 07:40:08 - INFO - __main__ -   Loading features from cached file /specific/netapp5_2/gamir/advml19/yuvalk/project/transformers/examples/lm_data/roberta_cached_lm_510_wiki.train.raw.time_filter.normalized
02/26/2020 07:40:16 - INFO - __main__ -   ***** Running training *****
02/26/2020 07:40:16 - INFO - __main__ -     Num examples = 163046
02/26/2020 07:40:16 - INFO - __main__ -     Num Epochs = 1
02/26/2020 07:40:16 - INFO - __main__ -     Instantaneous batch size per GPU = 1
02/26/2020 07:40:16 - INFO - __main__ -     Total train batch size (w. parallel, distributed & accumulation) = 256
02/26/2020 07:40:16 - INFO - __main__ -     Gradient Accumulation steps = 64
02/26/2020 07:40:16 - INFO - __main__ -     Total optimization steps = 636
02/26/2020 07:40:16 - INFO - __main__ -     Continuing training from checkpoint, will skip to saved global_step
02/26/2020 07:40:16 - INFO - __main__ -     Continuing training from epoch 0
02/26/2020 07:40:16 - INFO - __main__ -     Continuing training from global step 480
02/26/2020 07:40:16 - INFO - __main__ -     Will skip the first 480 steps in the first epoch

Expected behavior

I expect it to run from the last global step. I.e. optimization steps * gradient accumulation steps. Note that optimization steps == checkpoint suffix.

I made the following changes and it seems to work ok:

Former code:


            epochs_trained = global_step // (len(train_dataloader) // args.gradient_accumulation_steps)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader) // args.gradient_accumulation_steps)```
New code:
            ```global_step = int(checkpoint_suffix) * args.gradient_accumulation_steps
            epochs_trained = global_step // len(train_dataloader)
            steps_trained_in_current_epoch = global_step % (len(train_dataloader))```

- `transformers` version: latest
- Platform:
- Python version: 3.7
- PyTorch version (GPU?): GPU
- Using GPU in script?: yes
stale[bot] commented 4 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.