QA task scores - Githubissues

BarahFazili commented 3 years ago

There's no test set for QA, so the scores shown after the git PR would be on the same dev set, I believe. Since the dev set does have the labels we should have been able to use the f1 scores printed locally (which look okay ~72 for lr=5e-6, bs=2, epochs=16, max_seq=512, seed=32 ). I fail to understand why the scores retrieved via the pull request differ, being extremely poor ( ~ 25.3 )? Please let me know if there's anything I could be missing here or why this inconsistency?

PS: model is bert-base-multilingual-cased bert

Genius1237 commented 3 years ago

Hi. Could you post a complete training log for QA somewhere and link it here. I need to have a look at it before I can say anything>

BarahFazili commented 3 years ago

For QA there's a train set and a dev set. Evaluation is done on the dev set. The dev set seems to be provided with correct labels(and not just placeholders). So the f1 score on this set printed locally should've been the same as that rendered through the PR. Following is the log when run for default params and uncommenting parts of code in run_squad.py that prints the results after evaluation.

bash train.sh bert-base-multilingual-cased bert QA_EN_HI

Fine-tuning bert-base-multilingual-cased on QA_EN_HI 06/25/2021 18:23:58 - WARNING - main - Process rank: -1, device: cuda, n_gpu: 1, distributed training: False Some weights of the model checkpoint at bert-base-multilingual-cased were not used when initializing BertForQuestionAnswering: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias']

This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of BertForQuestionAnswering were not initialized from the model checkpoint at bert-base-multilingual-cased and are newly initialized: ['qa_outputs.weight', 'qa_outputs.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. 06/25/2021 18:24:14 - INFO - main - Training/evaluation parameters Namespace(adam_epsilon=1e-08, cache_dir='', config_name='', data_dir='/home/barah/exp/grey/GLUECoS/Data/Processed_Data/QA_EN_HI', device=device(type='cuda'), do_eval=True, do_lower_case=False, do_train=True, doc_stride=128, eval_all_checkpoints=False, evaluate_during_training=False, gradient_accumulation_steps=1, lang_id=0, learning_rate=5e-05, local_rank=-1, logging_steps=500, max_answer_length=30, max_grad_norm=1.0, max_query_length=64, max_seq_length=512, max_steps=-1, model_name_or_path='bert-base-multilingual-cased', model_type='bert', n_best_size=20, n_gpu=1, no_cuda=False, null_score_diff_threshold=0.0, num_train_epochs=5.0, output_dir='/home/barah/exp/grey/GLUECoS/Results/QA_EN_HI', overwrite_cache=False, overwrite_output_dir=True, per_gpu_eval_batch_size=8, per_gpu_train_batch_size=4, predict_file=None, save_steps=500, seed=42, threads=1, tokenizer_name='', train_file=None, verbose_logging=False, version_2_with_negative=True, warmup_steps=0, weight_decay=0.0) 06/25/2021 18:24:14 - INFO - main - Loading features from cached file /home/barah/exp/grey/GLUECoS/Data/Processed_Data/QA_EN_HI/cached_train_bert-base-multilingual-cased_512 06/25/2021 18:24:14 - INFO - main - Running training 06/25/2021 18:24:14 - INFO - main - Num examples = 438 06/25/2021 18:24:14 - INFO - main - Num Epochs = 5 06/25/2021 18:24:14 - INFO - main - Instantaneous batch size per GPU = 4 06/25/2021 18:24:14 - INFO - main - Total train batch size (w. parallel, distributed & accumulation) = 4 06/25/2021 18:24:14 - INFO - main - Gradient Accumulation steps = 1 06/25/2021 18:24:14 - INFO - main - Total optimization steps = 550 Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:28<00:00, 3.84it/s] 06/25/2021 18:24:43 - INFO - main - training loss QA= 2.6527178460901433███| 110/110 [00:28<00:00, 4.25it/s] Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.78it/s] 06/25/2021 18:25:12 - INFO - main - training loss QA= 1.2291807915676725███| 110/110 [00:29<00:00, 4.20it/s] Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.72it/s] 06/25/2021 18:25:41 - INFO - main - training loss QA= 0.717832342916253████| 110/110 [00:29<00:00, 4.17it/s] Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.75it/s] 06/25/2021 18:26:11 - INFO - main - training loss QA= 0.33675127934812654██| 110/110 [00:29<00:00, 4.18it/s] Iteration: 100%|█████████████████████████████████████████████████████████████████| 110/110 [00:29<00:00, 3.75it/s] 06/25/2021 18:26:40 - INFO - main - training loss QA= 0.22754330959604968██| 110/110 [00:29<00:00, 4.14it/s] Epoch: 100%|█████████████████████████████████████████████████████████████████████████| 5/5 [02:25<00:00, 29.19s/it] 06/25/2021 18:26:40 - INFO - main - global_step = 551, average loss = 1.0309306944591778 06/25/2021 18:26:40 - INFO - main - Loading features from cached file /home/barah/exp/grey/GLUECoS/Data/Processed_Data/QA_EN_HI/cached_dev_bert-base-multilingual-cased_512 06/25/2021 18:26:40 - INFO - main - Running evaluation 551 06/25/2021 18:26:40 - INFO - main - Num examples = 123 06/25/2021 18:26:40 - INFO - main - Batch size = 8 Evaluating: 100%|██████████████████████████████████████████████████████████████████| 16/16 [00:02<00:00, 6.87it/s] 06/25/2021 18:26:42 - INFO - main - Evaluation done in total 2.328559 secs (0.018931 sec per example) 06/25/2021 18:26:43 - INFO - main - eval f1 at 66.42857142857143 OrderedDict([('exact', 65.71428571428571), ('f1', 66.42857142857143), ('total', 70), ('HasAns_exact', 65.71428571428571), ('HasAns_f1', 66.42857142857143), ('HasAns_total', 70), ('best_exact', 65.71428571428571), ('best_exact_thresh', 0.0), ('best_f1', 66.42857142857143), ('best_f1_thresh', 0.0)]) 06/25/2021 18:26:43 - INFO - main - Results: {'exact': 65.71428571428571, 'f1': 66.42857142857143, 'total': 70, 'HasAns_exact': 65.71428571428571, 'HasAns_f1': 66.42857142857143, 'HasAns_total': 70, 'best_exact': 65.71428571428571, 'best_exact_thresh': 0.0, 'best_f1': 66.42857142857143, 'best_f1_thresh': 0.0}

The score through PR was given as: QA_EN_HI: 19.444444444444443 while the value locally printed is around 66.

Genius1237 commented 3 years ago

Could you share the file Data/Processed_Data/QA_EN_HI/dev-v2.0.json?

Since all the questions in the original dataset do not have contexts, DrQA is used to retrieve contexts for these question from Wikipedia. It looks like when you ran DrQA, it generated contexts for more examples, as the predictions.json that you have uploaded has more entries in it.

BarahFazili commented 3 years ago

PFA the dev file.

On Fri, Jun 25, 2021 at 9:23 PM Anirudh Srinivasan @.***> wrote:

Could you share the file Data/Processed_Data/QA_EN_HI/dev-v2.0.json?

Since all the questions in the original dataset do not have contexts, DrQA is used to retrieve contexts for these question from Wikipedia. It looks like when you ran DrQA, it generated contexts for more examples, as the predictions.json that you have uploaded has more entries in it?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/microsoft/GLUECoS/issues/52#issuecomment-868662025, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGYQJMRRYKONB3M2YM56WFTTUSQ55ANCNFSM44CEBHNA .

Genius1237 commented 3 years ago

Could you upload it some filesharing site or some site like pastebin?

Genius1237 commented 3 years ago

The file that I have on my end does not have questions with ID 235 onwards in the dev set. Could you make a backup of the dev file, delete the keys with ID 235 onwards and try running training again? It looks like the higher score you are getting is due to these extra questions being considered as part of the dev set

BarahFazili commented 3 years ago

Here's the dev set I had been using: Data/Processed_Data/QA_EN_HI/dev-v2.0.json. Even after removing the titles with ids 235 onwards, the inconsistency persists. Local dev set F1 score is ~67 while the PR gives around 24 !

Genius1237 commented 3 years ago

I will check and get back. When you updated the dev file, did you delete the cache file in the same directory? If not, please delete that file and try re-running.

BarahFazili commented 3 years ago

Also tried after deleting the cached dev file, that didn't help either.

Genius1237 commented 3 years ago

It seems that the QA dataset processing scripts are returning more data points in your case than what is actually expected.

I would suggest that you please try to rerun the QA preprocessing alone in a new python:3.6 docker container and check the dataset that you obtain. Please check if the train set has 259 entries and the test has 54 entries.

BarahFazili commented 3 years ago

I've been getting 313 entries in train and 70 in dev set.

Genius1237 commented 3 years ago

I seem to have figured out what the issue is. Would you be able to try running the code again with a few changes?

When you're running the docker container, use this exact image - python:3.6.10. Also, apply this patch to the GLUECoS repo. It changes 2 lines in the Data/Preprocess_scripts/preprocess_qa.sh file.

index 1fdaf95..3f57130 100644
--- a/Data/Preprocess_Scripts/preprocess_qa.sh
+++ b/Data/Preprocess_Scripts/preprocess_qa.sh
@@ -15,9 +15,9 @@ python $PREPROCESS_DIR/preprocess_drqa.py --data_dir $ORIGINAL_DATA_DIR
 git clone https://github.com/facebookresearch/DrQA.git
 cd DrQA
 git checkout 96f343c
-pip install -r requirements.txt
+pip install elasticsearch==7.8.0 nltk==3.5 scipy==1.5.0 prettytable==0.7.2 tqdm==4.46.1 regex==2020.6.8 termcolor==1.1.0 scikit-learn==0.23.1 numpy==1.18.5 torch==1.4.0
 python setup.py develop
-pip install spacy
+pip install spacy==2.3.0
 python -m spacy download xx_ent_wiki_sm
 python -c "import nltk;nltk.download(['punkt', 'averaged_perceptron_tagger', 'maxent_ne_chunker', 'words'])"
 ./download.sh

The preprocess_qa.sh script makes few modification to the DrQA repo. These are done in lines 24-32. Could you also please manually verify that these changes take effect (check after running it)?

If DrQA runs properly, the penultimate line of the preprocess_qa.sh script should be Finished. Total = 215.

BarahFazili commented 3 years ago

Yes, that solved it. Thanks a lot !

Genius1237 commented 3 years ago

Sorry about the issues. We rely on DrQA running in a "deterministic" manner. Due to updates to either the python version or some of the packages, this wasn't happening.

I will update the scripts and the readme with these additional instructions. Were you able to submit and run evaluation properly?

BarahFazili commented 3 years ago

Yes, the score on submission is consistent now. Thanks again.

microsoft / GLUECoS

QA task scores #52