Different results on Bias-NLI evaluation dataset

Alex357853 commented 1 year ago

Hi,

For the Extrinsic Benchmark: Natural Language Inference, I downloaded your provided princeton-nlp/mabel-bert-base-uncased checkpoint -- mabel_checkpoint_best.pt model and ran the Evaluation script you provided using the command:
python eval.py --model_name_or_path bert-base-uncased --load_from_file nli-mabel/mabel_checkpoint_best.pt --eval_data_path bias-nli/nli-dataset.csv. However, I did not achieve the same high results as reported by you:

total net neutral:        0.9170128866319063
total fraction neutral:   0.9828041166841379
total tau 0.5 neutral:    0.9824839530908697
total tau 0.7 neutral:    0.9681385585408803

My results were as follows:

total net neutral:        0.872817846119612
total fraction neutral:   0.9396406253855012
total tau 0.5 neutral:    0.9385957916322645
total tau 0.7 neutral:    0.9039280814706394

I was wondering if there might be a mistake in my implementation or if others have reported similar results. Thank you for taking the time to read my message and any help you could offer would be greatly appreciated.

jacqueline-he commented 1 year ago

Hi,

I'm not able to get your results; I was able to get very close numbers to the ones reported in the Github repo just now (see screenshot with timestamps below). Screenshot 2023-05-10 at 9 25 35 PM One thing that comes to mind is that it is important to use the exact environment (following the package versions specified in requirement.txt). We have found that different torch or huggingface versions leads to some variation across all experiments. Different hardware configurations could also result in some minor discrepancies - this run is on one V100 GPU.

It is also possible that the uploaded code / model download links are not correct, as I have done some refactoring after the initial upload. I can check this more thoroughly over the weekend.

Alex357853 commented 1 year ago

Hi,

Thank you for your prompt response. I notice that in your screenshot, the number of examples is 193651, while the number of examples in the provided nli-dataset.csv file is 1936511. After implementing the code in eval.py file,

    eval_dataset = load_dataset(
        "csv",
        data_files=args.eval_data_path,
        split="train[:-10%]",
        cache_dir=args.cache_dir,
    )

it shows 05/11/2023 14:16:12 - INFO - __main__ - Number of examples: 1742861, which is larger than 193651. I am wondering whether you are using a different dataset. If so, could you please provide the link? Additionally, I am also wondering why you are evaluating only a portion of the data rather than the entire dataset in this code. Thank you for taking the time to assist me.

Elfsong commented 7 months ago

Hi @Alex357853, could you share the nli-dataset.csv file? Thank you!

princeton-nlp / MABEL

Different results on Bias-NLI evaluation dataset #3