microsoft / GLUECoS

A benchmark for code-switched NLP, ACL 2020
https://microsoft.github.io/GLUECoS
MIT License
73 stars 57 forks source link

Performance on NLI task #47

Closed PC09 closed 3 years ago

PC09 commented 3 years ago

Hi!

I was trying to replicate the results for NLI task using multilingual BERT model. The GLUECoS paper says, mBERT gives 61.09 or 57.74 (from the leaderboard). When I am running the sample NLI script here with default parameters, my test accuracy is coming out to be very low ~33. Could anyone confirm that the numbers in the paper are for baseline or not? Also, I see the data was updated, are these numbers for older version of data?

Thanks!

BarahFazili commented 3 years ago

I am experiencing the same issue. The worst scores are replicated for NLI and QA.

Genius1237 commented 3 years ago

The runs where training diverges are the ones where you get 33% accuracy. Try running with different hyperparameters (seeds, batch_size, learning rate).

BarahFazili commented 3 years ago

For QA, there's no test set right ? So the evaluation is done on the dev set itself. I'm able to converge the training but printing out the F1 score locally (over ~70) is far from the score derived via the Pull Request(which is around 23)? I'm not able to figure out the inconsistency here? Can you please help clarify ?

Genius1237 commented 3 years ago

The test set for both QA and NLI are small, so fluctuations in the score are very easily possible. Getting good numbers on the train set alone won't be enough. I will suggest that you try running with smaller batch size (2 or 4) and smaller learning rate (1e-5 or 5e-6) and see what you get. We had to follow a similar procedure to get the results we reported.

PC09 commented 3 years ago

Thanks @Genius1237 , I was trying with the default hyper parameters. Will try tuning it as mentioned above.