Closed WilliamAntoniocrayon closed 2 years ago
The program was killed in 39 epoch while the num_train_epochs is 50.0.Then when I set the num_train_epochs to 30,the program was killed in 26 epoch.
Hi there, Thank you for your interest. I have encountered this bug elsewhere, and it may be due to the gpu memory leakage from pytorch. For this stage, I should have change the num_train_epochs to 20, because the max f1 score in this stage (the training corpus is over 100k) typically occurs at earlier epochs, should be lower than 20. You may check your best checkpoints from earlier epochs. Actually I should set an early-stopping mechanism to save some computing time, I will try to update this part in few days. Thanks again for bringing this up.
Thank you for your advice. I look forward to your revised code and your future work.
I'm curious that when I run the step 3, the program will always be killed without any hint.