Open kinggongzilla opened 5 months ago
Hi @kinggongzilla, thanks for filing this issue!
Currently we only support an early stopping that's based on the # of steps taken in an epoch, i.e. you can set max_steps_per_epoch
flag in the configuration to early stop your model based on a # of steps.
However this doesn't satisfy your use case of only early stopping / saving a checkpoint based on some validation results.
In training evaluation + stopping criteria based on evaluation is a large space we haven't looked deeply into, what do you folks think @ebsmothers @RdoubleA? I could see a future in which we allow users to specify a validation dataset or validation split, and incorporate validation metrics into our checkpointer for whether to save a checkpoint or not. This is definitely something we could look at enabling in the future if there's sufficient interest.
Thanks for the quick reply! Being able to define a validation dataset and do early stoppingbased on the validation loss would definitely be super helpful.
+1 this would be super useful.
+1 Would be super useful!
Thanks all for the comments. This feature (along with general validation loops) are fairly high on our wishlist right now. We still need to do a bit of design to make sure it's not too intrusive into our recipes, but definitely hear you on the need for this feature. We will keep you posted here!
Is there a way to evaluate the model performance during training on a validation dataset and only save a new checkpoint if it achieves lower validation loss?