huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
133.23k stars 26.61k forks source link

Trainer train continues after resume_from_checkpoint on a checkpoint with early stop #10290

Closed tanmay17061 closed 3 years ago

tanmay17061 commented 3 years ago

Environment info

When continuing training from checkpoint, Trainer does not check if the checkpoint terminated with an self.control.should_training_stop == True.

self.control.should_training_stop == True holds when:

  1. state.global_step >= state.max_steps
    • training does not resume on resume_from_checkpoint due to recovering steps information (state.global_step) from checkpoint state 👍
  2. Due to early stopping condition True
    • training resumes as no mechanism to find previous early stopping state 👎
    • even early_stopping_patience_counter is restarted from 0 on EarlyStoppingCallback init, irrespective of resume_from_checkpoint 👎

Who can help

@sgugger as issue in Trainer.

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

The tasks I am working on is:

To reproduce

Steps to reproduce the behavior:

  1. Initialize Trainer.train with resume_from_checkpoint pointing to a checkpoint that stopped due to early stopping

Expected behavior

Training should not happen as the checkpoint loaded had stopped due to early stopping.

sgugger commented 3 years ago

Indeed, I can see the problem. I'm not sure there is an easy fix however and I don't have time right now to build a proper callback checkpointing system. Will have to wait a little bit to be fixed!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Ubadub commented 8 months ago

@sgugger Could you please reopen this issue? The issue persists despite being automatically closed.