huggingface / pytorch-openai-transformer-lm

🐥A PyTorch implementation of OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI
MIT License
1.51k stars 285 forks source link

Avoid model overfitting #31

Open BangLiu opened 6 years ago

BangLiu commented 6 years ago

I adapted this model to a text classification problem, where my text is concated as: [start] text1 [delimiter] text2 [delimiter] text3 [classify] and it is just a binary classification problem. So use F.softmax for the model output and use BCE loss. I have 120,000 training examples, and 10,000 evaluation examples. n_ctx is set to be 500. For one epoch, it takes about 7 hours (1 GPU). When I use lm_coef = 0.5, I found that the training accuracy on my training dataset is 0.9, but dev accuracy is just 0.66. More epochs doesn't improve the accuracy for evaluation dataset. So this is exactly overfitting. I am looking for suggestions about what I can tune to make it not overfit, in either model of training settings?

BangLiu commented 6 years ago

After fixed some data processing issue, my current performance is: Epoch 1: train accuracy: 0.8593 dev accuracy: 0.7004, train_loss: 13.87, dev_loss: 20.79 Epoch 2: train accuracy: 0.9206 dev accuracy: 0.7157, train_loss: 10.63, dev_loss: 22.20 With more epochs, the dev_loss will keep growing, and accuracy is similar.

I found that the model has 116,531,713 trainable parameters. So I thought maybe the network is too big and remembers even 120,000 training examples. However, as the ROCStories only has less than 2000 examples, it doesn't get overfit. I don't know why my own data will get overfitting.

teucer commented 6 years ago

Had the same issue with imdb sentiment analysis. Would appreciate some pointers here...

rodgzilla commented 6 years ago

@teucer @BangLiu Have you tried even higher lm_coef?

If you want to reduce overfitting, you may also want to give an additional task to complete to the network (multi-task learning). This will give it something to do with its parameters.

teucer commented 6 years ago

@rodgzilla ok will do that. What about increasing the dropout probability in the classification head? Would it help to increase it?

dchatterjee172 commented 5 years ago

@teucer @BangLiu Did you guys solve the problem?

MrRobot2211 commented 4 years ago

Have you found any good way to regularize the network?

Chinmay-Vadgama commented 4 years ago

I have similar issue. I am training distilbert model after cleaning ISOT fake news dataset I am getting 99% validation accuracy after 1 epoch. It is predicting wrong labels on unseen data. I guess model is just remembering the input sequence and its clearly overfitting. So, How can I regularize it?

shreeyashyende commented 3 years ago

Add smoothing and dropouts

pratikchhapolika commented 2 years ago

Any answer on this. How to avoid overfitting on smaller data-points. Is dropout only option?