Closed Backpackerice closed 3 years ago
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Hi @Backpackerice did you solve the issue? I also encounter the same issue while training with Trainer
API.
I have found the issue. I was giving the model that I am initialising before Trainer
object. So, it wasn't initialised with a seed. I solved the issue by setting the seed before initialising the model. It can be also solved by providing model_init
arg for Trainer
.
Hi there,
I am using my customized bert script to train a model. However, everything even I keep the same setting for lr, AdamW weight decay and epoch, and run on the same platform (cuda on SageMaker) with same torch (1.5.0) and transformers (2.11.0) versions, the results still change a lot in terms of the loss. This make my different experiments not comparable.
Can someone who has experienced this before or have any ideas please advice me on what should I do? I really want to solve this inreproducible issue so that I can continue on my experiments. Super appreciated for your help!
Details as below:
For example, if I set epoch = 4, lr = 1e-5, decay for AdamW as 0.01. For one run I got this result for the first epoch only showing the last complete 100 batches result:
And for the second attempt I got this for the first epoch:
Note that, the last used lr rate per 100 batches are the same, while the average loss per 100 batches are slightly different. But this result in the predictions for the validation and testing data set very different.
I already set the seed during my model with this function below:
And my model script is like this:
The loss is calculated using BCEWithLogitsLoss() from torch.nn.
The train, validation and test part script is as below:
optimizer and scheduler part are defined as below:
And to run the model, I use below script: