Open seetagopal opened 3 years ago
Interesting!
Are you using HuggingFace? I believe that some earlier versions of HuggingFace had a bug with HANS evaluation. See these GitHub issues: https://github.com/huggingface/transformers/issues/4766 https://github.com/huggingface/transformers/issues/6179
Therefore, if you're using HuggingFace, are you using the latest version?
Other things to look at:
No I am not using HuggingFace. I have cloned the BERT Github repository and just followed the instructions for Fine tuning and Pre training with BERT.
Thanks for the answers! It sounds like you are doing the same thing we did then, which definitely makes the discrepancy puzzling.
Based on this, here are the remaining things I can think of to make sure this is not a bug:
run_classifier.py
to get your model's predictions on the validation set. The command you would use is something like python run_classifier.py --task_name=MNLI --do_predict=true --data_dir=$GLUE_DIR/MNLI --vocab_file=$BERT_BASE_DIR/vocab.txt --bert_config_file=$BERT_BASE_DIR/bert_config.json --init_checkpoint=$TRAINED_CLASSIFIER --max_seq_length=128 --output_dir=$TRAINED_CLASSIFIER/MNLI
, where you would replace $GLUE_DIR
, $BERT_BASE_DIR
, and $TRAINED_CLASSIFIER
with the relevant directory names. You can use this to be absolutely certain that the labels are not getting mixed up. The order you used (contradiction, entailment and neutral) sounds right to me, but it's still possible that the labels are getting mixed up somewhere. Evaluating on the MNLI validation set should tell you if that's happening, because you should get an accuracy of around 0.84 if the labels are right, or something a lot lower if the labels are wrong.`
Thank you so much for your response. I am trying to re run everything from scratch based on the information you have provided. I will post the results I got once I have them.
Sounds good! One more idea that occurred to me: We have released our 100 fine-tuned weights online - description here: https://github.com/tommccoy1/hans/tree/master/berts_of_a_feather#model-weights
All 100 of those fine-tuned weights should prefer entailment over non-entailment. So you could try downloading one of them and see if your evaluation pipeline matches the results we reported (to see whether it's a difference in the fine-tuning process or the evaluation process).
Sure I will do that. Thank you
I have followed the instructions and tried running everything from scratch. The screenshot below shows the results of fine tuning MNLI on BERT in which I have replaced the MNLI test file with the HANS evaluation set. I followed the procedure given (https://github.com/tommccoy1/hans/tree/master/berts_of_a_feather#model-weights) except the git checkout.
Also I tried fine tuning MNLI and replaced the MNLI test data with MNLI validation set. I have evaluated this result by combining the (neutral, contradiction) label as non-entailment. To find the accuracy, I followed the same procedure given in the HANS evaluation script(finding the correct and incorrect count, correct count *1.0 / total). Below are the results I got. Entailment results - 0.46 Nonentailment results - 0.7485
I haven't yet compared my fine-tuned weights with the published fine-tuned weights. I will do that but looking into the fine tuning results I got seems to be same as what I got previously.
Thanks for this info!
Based on the MNLI results you got (0.46 for entailment, 0.7485 for non-entailment), my guess is that the labels are getting mixed up somewhere: BERT fine-tuned on MNLI should get about 0.84 accuracy. And (unlike for HANS), the MNLI accuracy should not vary much based on whether the correct label is "entailment" or "nonentailment."
So, what I would try is this:
I don't know where the labels could be getting mixed up, but it does really sound like that's what is happening.
Thank you so much. I tried all the 6 combinations as you have specified. These are the results I got.
Contradiction Entailment Neutral - 0.4831 0.4568 0.3160 Contradiction Neutral Entailment - 0.4829 0.1504 0.2863 Entailment Contradiction Neutral - 0.2563 0.2403 0.3160 Entailment Neutral Contradiction - 0.2550 0.1505 0.2753 Neutral Entailment Contradiction - 0.1478 0.4587 0.2752 Neutral Contradiction entailment - 0.1478 0.2752 0.4587
But the results I got seem to be incorrect. I am trying to re run everything to double check if I am not doing any mistake.
Happy to help! And good luck - I agree that something seems wrong in these results, since the overall accuracy should be over 0.80.
Sorry for the late reply. I couldn't work on this experiment in the middle. But I started running everything again from scratch. As you suggested
I am running it in Google colab with GPU. I haven't yet tried it with different seed values as given in BERTs of feather paper.
I will try with the 100 fine tuned weights that are released. But fine tuning MNLI on BERT should also give atleast some similar results right?
Hello again! I did not notice your reply until now - my sincere apologies for that!
I agree with you that fine tuning BERT on MNLI should give similar results, so unfortunately I'm not able to figure out why the results you are getting are different from the one in the paper - there aren't any other things to check that are coming to my mind right now.
The reason I just returned to this issue was that someone else emailed me about having the same issue (the issue of fine-tuned BERT always outputting non-entailment on HANS, rather than entailment). I have pointed that person to this issue, and I will let you know if we are able to figure out what is going on in this case.
I'm sorry that I couldn't give a clearer answer! Neural networks can be so sensitive to minor details that it could potentially be any aspect of the setup (potentially even just the specific machines we are using). But I wish we could figure out what factor it is!
Thanks for your reply. I tried using the pre-trained weights published as part of berts_of_a_feather. I used predict_47 as given in the instructions, and I can replicate your results. Then in your predict_47 script, I added the command to fine-tune BERT on MNLI and used that weights for predicting. If I do this, I can replicate your result. I am not sure what I missed when I fine-tune it myself. But if I use your script, I can replicate the results.
Glad to hear it, and thank you for following up!
I trained BERT on MNLI and evaluated it on HANS data. I got some results which are different from what was given in the paper. I am getting all the predictions as non entailment. I tried both fine-tuning BERT and pretraining BERT.
Why are my results drastically different? Could you please help me with this. Thank you