Different results - Githubissues

seetagopal commented 3 years ago

I trained BERT on MNLI and evaluated it on HANS data. I got some results which are different from what was given in the paper. I am getting all the predictions as non entailment. I tried both fine-tuning BERT and pretraining BERT. Screen Shot 2021-04-17 at 12 41 35 AM Screen Shot 2021-04-17 at 12 56 59 AM

Why are my results drastically different? Could you please help me with this. Thank you

tommccoy1 commented 3 years ago

Interesting!

Are you using HuggingFace? I believe that some earlier versions of HuggingFace had a bug with HANS evaluation. See these GitHub issues: https://github.com/huggingface/transformers/issues/4766 https://github.com/huggingface/transformers/issues/6179

Therefore, if you're using HuggingFace, are you using the latest version?

Other things to look at:

I'm not clear on what you mean when you say you tried pre-training and fine-tuning BERT. Did you rerun pretraining yourself?
What are the 2 different sets of numbers that you posted?
How does the model do on MNLI? It should get around 84% accuracy (using BERT-base-uncased), so if it's a lot lower than that, it would be a sign that something is wrong with fine-tuning.
Have you looked at the actual predictions that the model is making? Specifically, it would be good to make sure that output indices ("0". "1", and "2") are properly being mapped to the right labels ("entailment", "contradiction", and "neutral", but not necessarily in that order), and that the evaluation script is properly mapping "contradiction" and "neutral" to "non-entailment', while keeping "entailment" as "entailment."
What hyperparameters are you using for fine-tuning? Most importantly, how many epochs of MNLI is it fine-tuned for?
Have you tried multiple seeds? Performance on HANS can be highly variable based on seed (https://www.aclweb.org/anthology/2020.blackboxnlp-1.21/), though even with this variability BERT has always scored better on "entailment" than "non-entailment", in my experience.

seetagopal commented 3 years ago

No I am not using HuggingFace. I have cloned the BERT Github repository and just followed the instructions for Fine tuning and Pre training with BERT.

For Fine tuning with BERT, I was using BERT- Base Uncased as the Pre trained model and ran the classifier on MNLI dataset by replacing the test data with the HANS evaluation data. I haven't changed any other parameters which was given in BERT GitHub page. For Pretraining with BERT I did the pretraining myself without using any of the given pre trained weights.
First screen shot is what I got out of fine tuning BERT(using BERT-Base Uncased) and the second screenshot is the output of pretraining BERT(without using any pretrained models).
I tried fine tuning on MNLI train data and replaced the MNLI test data with Hans evaluation data(First screenshot above is the result of doing that)
In the run classifier.py(BERT) the labels are ordered as contradiction, entailment and neutral. So I am mapping my results in that order(second column of my predictions are entailment). In my evaluation script, I am mapping the entailment as entailment and the other two as non-entailment.
I am not changing any of the hyper parameters that are given in the BERT GitHub page (num_train_epochs=3.0)
I haven't yet tried the multiple seeds. I read the paper which is re-running BERT 100 times but since I am getting some different result than what was specified in that paper, I want to make sure I am not doing any mistake. I will definitely try multiple seeds. Thank you so much.

tommccoy1 commented 3 years ago

Thanks for the answers! It sounds like you are doing the same thing we did then, which definitely makes the discrepancy puzzling.

Based on this, here are the remaining things I can think of to make sure this is not a bug:

At this link, we describe exactly how we fine-tuned BERT on MNLI. It sounds like this is what you did too. But, just in case there are details I'm forgetting about, you can compare to our description to see if anything is different: https://github.com/tommccoy1/hans/tree/master/berts_of_a_feather#fine-tuning-bert-on-mnli. You could even try following those steps yourself, though I know that would take a while to fine-tune BERT again.
I would recommend evaluating your model on MNLI (you can use the MNLI validation set, since its labels are publicly available, unlike the test set). You should be able to do this by replacing the MNLI test set with the MNLI validation set and then using run_classifier.py to get your model's predictions on the validation set. The command you would use is something like python run_classifier.py --task_name=MNLI --do_predict=true --data_dir=$GLUE_DIR/MNLI --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$TRAINED_CLASSIFIER --max_seq_length=128 --output_dir=$TRAINED_CLASSIFIER/MNLI, where you would replace $GLUE_DIR, $BERT_BASE_DIR, and $TRAINED_CLASSIFIER with the relevant directory names. You can use this to be absolutely certain that the labels are not getting mixed up. The order you used (contradiction, entailment and neutral) sounds right to me, but it's still possible that the labels are getting mixed up somewhere. Evaluating on the MNLI validation set should tell you if that's happening, because you should get an accuracy of around 0.84 if the labels are right, or something a lot lower if the labels are wrong.

`

seetagopal commented 3 years ago

Thank you so much for your response. I am trying to re run everything from scratch based on the information you have provided. I will post the results I got once I have them.

tommccoy1 commented 3 years ago

Sounds good! One more idea that occurred to me: We have released our 100 fine-tuned weights online - description here: https://github.com/tommccoy1/hans/tree/master/berts_of_a_feather#model-weights

All 100 of those fine-tuned weights should prefer entailment over non-entailment. So you could try downloading one of them and see if your evaluation pipeline matches the results we reported (to see whether it's a difference in the fine-tuning process or the evaluation process).

seetagopal commented 3 years ago

Sure I will do that. Thank you

seetagopal commented 3 years ago

I have followed the instructions and tried running everything from scratch. The screenshot below shows the results of fine tuning MNLI on BERT in which I have replaced the MNLI test file with the HANS evaluation set. I followed the procedure given (https://github.com/tommccoy1/hans/tree/master/berts_of_a_feather#model-weights) except the git checkout. Screen Shot 2021-04-18 at 2 53 05 PM

Also I tried fine tuning MNLI and replaced the MNLI test data with MNLI validation set. I have evaluated this result by combining the (neutral, contradiction) label as non-entailment. To find the accuracy, I followed the same procedure given in the HANS evaluation script(finding the correct and incorrect count, correct count *1.0 / total). Below are the results I got. Entailment results - 0.46 Nonentailment results - 0.7485

I haven't yet compared my fine-tuned weights with the published fine-tuned weights. I will do that but looking into the fine tuning results I got seems to be same as what I got previously.

tommccoy1 commented 3 years ago

Thanks for this info!

Based on the MNLI results you got (0.46 for entailment, 0.7485 for non-entailment), my guess is that the labels are getting mixed up somewhere: BERT fine-tuned on MNLI should get about 0.84 accuracy. And (unlike for HANS), the MNLI accuracy should not vary much based on whether the correct label is "entailment" or "nonentailment."

So, what I would try is this:

Take the output indices you're getting (0, 1, and 2), and try all 6 possible ways of mapping those to labels (entailment, contradiction, neutral) - e.g., try 0=entailment, 1=contradiction, 2=neutral; then try 0=entailment, 1=neutral, 2=contradiction; etc.
When you try all 6 of these ways of doing it, there should be one way that gives a dramatically higher score than the other 5 ways. It should also show little accuracy difference across all labels. That one way is then probably the correct label assignment.

I don't know where the labels could be getting mixed up, but it does really sound like that's what is happening.

seetagopal commented 3 years ago

Thank you so much. I tried all the 6 combinations as you have specified. These are the results I got.

Contradiction Entailment Neutral - 0.4831 0.4568 0.3160 Contradiction Neutral Entailment - 0.4829 0.1504 0.2863 Entailment Contradiction Neutral - 0.2563 0.2403 0.3160 Entailment Neutral Contradiction - 0.2550 0.1505 0.2753 Neutral Entailment Contradiction - 0.1478 0.4587 0.2752 Neutral Contradiction entailment - 0.1478 0.2752 0.4587

But the results I got seem to be incorrect. I am trying to re run everything to double check if I am not doing any mistake.

tommccoy1 commented 3 years ago

Happy to help! And good luck - I agree that something seems wrong in these results, since the overall accuracy should be over 0.80.

gseetha04 commented 3 years ago

Sorry for the late reply. I couldn't work on this experiment in the middle. But I started running everything again from scratch. As you suggested

I have git cloned Bert and downloded MNLI from glue data. After that I have replaced the MNLI test data with it's validation data after deleting the gold label column.
Fine tuned BERT on MNLI and made some predictions. Compared the predictions with the gold label and got an overall accuracy of 83% for contradiction, entailment and neutral label. I think that is the correct label.
I have replaced the HANS test data as MNLI test data and ran BERT on it(using the same fine-tuned weights which I used before for running MNLI validation data). Predictions on the HANS test data shows all non-entailment which is opposite to the results given in the paper.

I am running it in Google colab with GPU. I haven't yet tried it with different seed values as given in BERTs of feather paper.

I will try with the 100 fine tuned weights that are released. But fine tuning MNLI on BERT should also give atleast some similar results right?

tommccoy1 commented 2 years ago

Hello again! I did not notice your reply until now - my sincere apologies for that!

I agree with you that fine tuning BERT on MNLI should give similar results, so unfortunately I'm not able to figure out why the results you are getting are different from the one in the paper - there aren't any other things to check that are coming to my mind right now.

The reason I just returned to this issue was that someone else emailed me about having the same issue (the issue of fine-tuned BERT always outputting non-entailment on HANS, rather than entailment). I have pointed that person to this issue, and I will let you know if we are able to figure out what is going on in this case.

I'm sorry that I couldn't give a clearer answer! Neural networks can be so sensitive to minor details that it could potentially be any aspect of the setup (potentially even just the specific machines we are using). But I wish we could figure out what factor it is!

gseetha04 commented 2 years ago

Thanks for your reply. I tried using the pre-trained weights published as part of berts_of_a_feather. I used predict_47 as given in the instructions, and I can replicate your results. Then in your predict_47 script, I added the command to fine-tune BERT on MNLI and used that weights for predicting. If I do this, I can replicate your result. I am not sure what I missed when I fine-tune it myself. But if I use your script, I can replicate the results.

tommccoy1 commented 2 years ago

Glad to hear it, and thank you for following up!

tommccoy1 / hans

Different results #6