microsoft / GLUECoS

A benchmark for code-switched NLP, ACL 2020
https://microsoft.github.io/GLUECoS
MIT License
73 stars 57 forks source link

QA task scores #52

Closed BarahFazili closed 3 years ago

BarahFazili commented 3 years ago

There's no test set for QA, so the scores shown after the git PR would be on the same dev set, I believe. Since the dev set does have the labels we should have been able to use the f1 scores printed locally (which look okay ~72 for lr=5e-6, bs=2, epochs=16, max_seq=512, seed=32 ). I fail to understand why the scores retrieved via the pull request differ, being extremely poor ( ~ 25.3 )? Please let me know if there's anything I could be missing here or why this inconsistency?

PS: model is bert-base-multilingual-cased bert

Genius1237 commented 3 years ago

Hi. Could you post a complete training log for QA somewhere and link it here. I need to have a look at it before I can say anything>

BarahFazili commented 3 years ago

For QA there's a train set and a dev set. Evaluation is done on the dev set. The dev set seems to be provided with correct labels(and not just placeholders). So the f1 score on this set printed locally should've been the same as that rendered through the PR. Following is the log when run for default params and uncommenting parts of code in run_squad.py that prints the results after evaluation.

bash train.sh bert-base-multilingual-cased bert QA_EN_HI

Fine-tuning bert-base-multilingual-cased on QA_EN_HI 06/25/2021 18:23:58 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

The score through PR was given as: QA_EN_HI: 19.444444444444443 while the value locally printed is around 66.

Genius1237 commented 3 years ago

Could you share the file Data/Processed_Data/QA_EN_HI/dev-v2.0.json?

Since all the questions in the original dataset do not have contexts, DrQA is used to retrieve contexts for these question from Wikipedia. It looks like when you ran DrQA, it generated contexts for more examples, as the predictions.json that you have uploaded has more entries in it.

BarahFazili commented 3 years ago

PFA the dev file.

On Fri, Jun 25, 2021 at 9:23 PM Anirudh Srinivasan @.***> wrote:

Could you share the file Data/Processed_Data/QA_EN_HI/dev-v2.0.json?

Since all the questions in the original dataset do not have contexts, DrQA is used to retrieve contexts for these question from Wikipedia. It looks like when you ran DrQA, it generated contexts for more examples, as the predictions.json that you have uploaded has more entries in it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/GLUECoS/issues/52#issuecomment-868662025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGYQJMRRYKONB3M2YM56WFTTUSQ55ANCNFSM44CEBHNA .

Genius1237 commented 3 years ago

Could you upload it some filesharing site or some site like pastebin?

Genius1237 commented 3 years ago

The file that I have on my end does not have questions with ID 235 onwards in the dev set. Could you make a backup of the dev file, delete the keys with ID 235 onwards and try running training again? It looks like the higher score you are getting is due to these extra questions being considered as part of the dev set

BarahFazili commented 3 years ago

Here's the dev set I had been using: Data/Processed_Data/QA_EN_HI/dev-v2.0.json. Even after removing the titles with ids 235 onwards, the inconsistency persists. Local dev set F1 score is ~67 while the PR gives around 24 !

Genius1237 commented 3 years ago

I will check and get back. When you updated the dev file, did you delete the cache file in the same directory? If not, please delete that file and try re-running.

BarahFazili commented 3 years ago

Also tried after deleting the cached dev file, that didn't help either.

Genius1237 commented 3 years ago

It seems that the QA dataset processing scripts are returning more data points in your case than what is actually expected.

I would suggest that you please try to rerun the QA preprocessing alone in a new python:3.6 docker container and check the dataset that you obtain. Please check if the train set has 259 entries and the test has 54 entries.

BarahFazili commented 3 years ago

I've been getting 313 entries in train and 70 in dev set.

Genius1237 commented 3 years ago

I seem to have figured out what the issue is. Would you be able to try running the code again with a few changes?

When you're running the docker container, use this exact image - python:3.6.10. Also, apply this patch to the GLUECoS repo. It changes 2 lines in the Data/Preprocess_scripts/preprocess_qa.sh file.

index 1fdaf95..3f57130 100644
--- a/Data/Preprocess_Scripts/preprocess_qa.sh
+++ b/Data/Preprocess_Scripts/preprocess_qa.sh
@@ -15,9 +15,9 @@ python $PREPROCESS_DIR/preprocess_drqa.py --data_dir $ORIGINAL_DATA_DIR
 git clone https://github.com/facebookresearch/DrQA.git
 cd DrQA
 git checkout 96f343c
-pip install -r requirements.txt
+pip install elasticsearch==7.8.0 nltk==3.5 scipy==1.5.0 prettytable==0.7.2 tqdm==4.46.1 regex==2020.6.8 termcolor==1.1.0 scikit-learn==0.23.1 numpy==1.18.5 torch==1.4.0
 python setup.py develop
-pip install spacy
+pip install spacy==2.3.0
 python -m spacy download xx_ent_wiki_sm
 python -c "import nltk;nltk.download(['punkt', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'])"
 ./download.sh

The preprocess_qa.sh script makes few modification to the DrQA repo. These are done in lines 24-32. Could you also please manually verify that these changes take effect (check after running it)?

If DrQA runs properly, the penultimate line of the preprocess_qa.sh script should be Finished. Total = 215.

BarahFazili commented 3 years ago

Yes, that solved it. Thanks a lot !

Genius1237 commented 3 years ago

Sorry about the issues. We rely on DrQA running in a "deterministic" manner. Due to updates to either the python version or some of the packages, this wasn't happening.

I will update the scripts and the readme with these additional instructions. Were you able to submit and run evaluation properly?

BarahFazili commented 3 years ago

Yes, the score on submission is consistent now. Thanks again.