Closed tanmay17061 closed 3 years ago
Indeed, I can see the problem. I'm not sure there is an easy fix however and I don't have time right now to build a proper callback checkpointing system. Will have to wait a little bit to be fixed!
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
@sgugger Could you please reopen this issue? The issue persists despite being automatically closed.
Environment info
When continuing training from checkpoint, Trainer does not check if the checkpoint terminated with an
self.control.should_training_stop == True
.self.control.should_training_stop == True
holds when:state.global_step >= state.max_steps
resume_from_checkpoint
due to recovering steps information (state.global_step
) from checkpoint state 👍early_stopping_patience_counter
is restarted from 0 onEarlyStoppingCallback
init, irrespective ofresume_from_checkpoint
👎Who can help
@sgugger as issue in Trainer.
Information
Model I am using (Bert, XLNet ...):
The problem arises when using:
The tasks I am working on is:
To reproduce
Steps to reproduce the behavior:
Trainer.train
withresume_from_checkpoint
pointing to a checkpoint that stopped due to early stoppingExpected behavior
Training should not happen as the checkpoint loaded had stopped due to early stopping.