microsoft / GLUECoS

A benchmark for code-switched NLP, ACL 2020
https://microsoft.github.io/GLUECoS
MIT License
73 stars 57 forks source link

The training dataset of NLI have two labels while test dataset have three label ? #63

Closed lizhiustc closed 2 years ago

lizhiustc commented 2 years ago

For nli datasets, the training datasets have two labels : entailment and contradictory. But gold true of test dataset have three labels ! Because I pull two results.zip files with all entailment and all contradictory respectively. Both of them got 33.3% ! So the gold true of test dataset must have another label (maybe “neutral”) . So your nli tasks is training on two lables datasets and test on three labels datasets ? Please check your datasets carefully !

Genius1237 commented 2 years ago

We're using macro F1 as the evaluation metric, so getting 0.33 as the score for the result you submitted is expected. We are not using accuracy.

zhilizju commented 2 years ago

We're using macro F1 as the evaluation metric, so getting 0.33 as the score for the result you submitted is expected. We are not using accuracy.

Thank you a lot. So the results in the paper (GLUECoS : An Evaluation Benchmark for Code-Switched NLP ) and your leaderboard employ different metrics ? You report accuracy in paper and macro-averaged F1 on the leaderboard ? Another questions: Are all your results (SA , NLI ,QA ,NER ,POS, LID) on the leaderboard evaluated by macro-averaged F1 ? It seems that all resuts in paper are different from leaderboard.

Genius1237 commented 2 years ago

There were a few small differences in the splits used in the paper, so use the results on the leaderboard for any comparisons you want to make. POS, NER, Sentiment and NLI use F1. QA uses F1 as defined by squad. MT uses BLEU.