Trainer train continues after resume_from_checkpoint on a checkpoint with early stop

tanmay17061 commented 3 years ago

Environment info

When continuing training from checkpoint, Trainer does not check if the checkpoint terminated with an self.control.should_training_stop == True.

self.control.should_training_stop == True holds when:

state.global_step >= state.max_steps
- training does not resume on resume_from_checkpoint due to recovering steps information (state.global_step) from checkpoint state 👍
Due to early stopping condition True
- training resumes as no mechanism to find previous early stopping state 👎
- even early_stopping_patience_counter is restarted from 0 on EarlyStoppingCallback init, irrespective of resume_from_checkpoint 👎

Who can help

@sgugger as issue in Trainer.

Information

Model I am using (Bert, XLNet ...):

The problem arises when using:

[ ] the official example scripts: (give details below)
[x] my own modified scripts: (give details below)

The tasks I am working on is:

[ ] an official GLUE/SQUaD task: (give the name)
[x] my own task or dataset: (give details below)

To reproduce

Steps to reproduce the behavior:

Initialize Trainer.train with resume_from_checkpoint pointing to a checkpoint that stopped due to early stopping

Expected behavior

Training should not happen as the checkpoint loaded had stopped due to early stopping.

sgugger commented 3 years ago

Indeed, I can see the problem. I'm not sure there is an easy fix however and I don't have time right now to build a proper callback checkpointing system. Will have to wait a little bit to be fixed!

github-actions[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Ubadub commented 8 months ago

@sgugger Could you please reopen this issue? The issue persists despite being automatically closed.

huggingface / transformers