namisan / mt-dnn

Multi-Task Deep Neural Networks for Natural Language Understanding
MIT License
2.23k stars 411 forks source link

RoBERTa results are much worse than the BERTs' #218

Closed DiamondRock closed 3 years ago

DiamondRock commented 3 years ago

I have run train.py for three models, bert-large, bert-base, and roberta-base, on two different multi-class classification datasets, one with 3 labels and one with 5 labels. The results for BERTs are pretty good, but the results for RoBERTa are pretty bad. By printing the confusion matrices at each epoch, it is understandable what the differences are. For instance, at epoch 2, the results for BERT base uncased for 5-class classification are: 07/05/2021 08:47:44 Task fiveClass -- epoch 2 -- Dev ACC: 84.121 07/05/2021 08:47:44 Task fiveClass -- epoch 2 -- Dev F1MAC: 81.951 07/05/2021 08:47:44 Task fiveClass -- epoch 2 -- Dev F1MIC: 84.121 07/05/2021 08:47:44 Task fiveClass-- epoch 2 -- Dev CMAT: [[328 37 7 39 13] [ 20 545 26 20 12] [ 4 16 118 14 5] [ 26 15 21 405 33] [ 9 5 9 15 437]]

And the results for RoBERTA-base are: 07/05/2021 08:56:05 Task fiveClass -- epoch 3 -- Dev ACC: 47.958 07/05/2021 08:56:05 Task fiveClass -- epoch 3 -- Dev F1MAC: 35.243 07/05/2021 08:56:05 Task fiveClass -- epoch 3 -- Dev F1MIC: 47.958 07/05/2021 08:56:05 Task fiveClass -- epoch 3 -- Dev CMAT: [[ 65 242 0 56 61] [ 44 512 0 43 24] [ 29 89 0 16 23] [ 62 226 0 111 101] [ 21 39 0 58 357]]

I have gone up to epoch 20; the results improve a little for RoBERTa but start to degrade after some epochs. If you notice, the RoBERTa has absolute zero capability to detect label 2. I ran this experiment on the 3-class classification as well and found that there also RoBERTa cannot detect label 2 at all.

Does anybody have any solution for this?

namisan commented 3 years ago

You may need to change the learning rate, e.g., RoBERTa requires smaller LR.

DiamondRock commented 3 years ago

Thanks, but even after reducing the learning rate, I still get pretty bad results. I changed the learning rate from the default value (5e-5) to 1e-5, 1e-6, and they yield the same low performances. On the other hand, increasing it to 5e-4 and 5e-3 also results in the same performance.

chawannut157 commented 3 years ago

I observed the same thing when comparing ROBERTA vs BERT. I also tried fine-tuning ROBERTA via HF directly and the performance is better in HF.