Open BangLiu opened 6 years ago
After fixed some data processing issue, my current performance is: Epoch 1: train accuracy: 0.8593 dev accuracy: 0.7004, train_loss: 13.87, dev_loss: 20.79 Epoch 2: train accuracy: 0.9206 dev accuracy: 0.7157, train_loss: 10.63, dev_loss: 22.20 With more epochs, the dev_loss will keep growing, and accuracy is similar.
I found that the model has 116,531,713 trainable parameters. So I thought maybe the network is too big and remembers even 120,000 training examples. However, as the ROCStories only has less than 2000 examples, it doesn't get overfit. I don't know why my own data will get overfitting.
Had the same issue with imdb sentiment analysis. Would appreciate some pointers here...
@teucer @BangLiu Have you tried even higher lm_coef
?
If you want to reduce overfitting, you may also want to give an additional task to complete to the network (multi-task learning). This will give it something to do with its parameters.
@rodgzilla ok will do that. What about increasing the dropout probability in the classification head? Would it help to increase it?
@teucer @BangLiu Did you guys solve the problem?
Have you found any good way to regularize the network?
I have similar issue. I am training distilbert model after cleaning ISOT fake news dataset I am getting 99% validation accuracy after 1 epoch. It is predicting wrong labels on unseen data. I guess model is just remembering the input sequence and its clearly overfitting. So, How can I regularize it?
Add smoothing and dropouts
Any answer on this. How to avoid overfitting on smaller data-points. Is dropout
only option?
I adapted this model to a text classification problem, where my text is concated as: [start] text1 [delimiter] text2 [delimiter] text3 [classify] and it is just a binary classification problem. So use F.softmax for the model output and use BCE loss. I have 120,000 training examples, and 10,000 evaluation examples. n_ctx is set to be 500. For one epoch, it takes about 7 hours (1 GPU). When I use lm_coef = 0.5, I found that the training accuracy on my training dataset is 0.9, but dev accuracy is just 0.66. More epochs doesn't improve the accuracy for evaluation dataset. So this is exactly overfitting. I am looking for suggestions about what I can tune to make it not overfit, in either model of training settings?