Unable to reproduce results

Hi there, I am running the evaluation metrics of Lexical Match and BertScore. But I am getting extremely different results from those found in the paper. I wanted to bring this to your attention to make sure I am not making any mistakes in the code and set up.

I start by loading the EVOUNA benchmark and combining the points into a single dataframe such that each row is an example of its own (with a gold answer and prediction from some model)

def load_evouna():
    # question, prediction, reference
    nq = pd.read_json("data/raw/EVOUNA/NQ.json")
    tq = pd.read_json("data/raw/EVOUNA/TQ.json")
    columns = ["dataset_name", "question_id", "question", "prediction", "reference", "judge", "model"]
    models = ['fid', 'gpt35', 'chatgpt', 'gpt4', 'newbing']
    data = []
    dname = "nq"
    for i in range(len(nq)):
        row = nq.iloc[i]
        id = i
        question = row["question"]
        reference = row["golden_answer"]        
        for model in models:
            data.append([dname, id, question, row[f"answer_{model}"], reference, row[f"judge_{model}"], model])
    dname = "tq"
    for i in range(len(tq)):
        row = tq.iloc[i]
        id = i
        question = row["question"]
        reference = row["golden_answer"]
        for model in models:
            data.append([dname, id, question, row[f"answer_{model}"], reference, row[f"judge_{model}"], model])
    return pd.DataFrame(data, columns=columns)

The above code gives me a dataframe with columns: ["dataset_name", "question_id", "question", "prediction", "reference", "judge", "model"] and has (after dropping NaNs) 25358 rows.

I loop over the dataframe and compute the scores in the following ways:

from evaluate import load

class BertScore:
    def __init__(self) -> None:
        self.bertscorer = load("bertscore")
        self.metric_names = ["bertscore_precision", "bertscore_recall", "bertscore_f1"]

    def perform(self, prediction, reference):
        score = self.bertscorer.compute(predictions=[prediction], references=[reference], lang="en")
        return score['precision'][0], score['recall'][0], score['f1'][0]

class ExactMatch():
    def __init__(self): 
        self.metric_names = ["match"]

    def perform(self, prediction, reference):
        return reference.strip().lower() in prediction.strip().lower()

This tries to follow the lexicalMatch description outlined in the paper. With this, I get very different results from the original paper. I get a bertscore correlation of:

>>> df.groupby(['dataset_name', 'model']).corr()['judge']
dataset_name  model
nq            chatgpt  judge                  1.000000
                       bertscore_precision    0.231207
                       bertscore_recall       0.449337
                       bertscore_f1           0.414990
              fid      judge                  1.000000
                       bertscore_precision    0.567505
                       bertscore_recall       0.373492
                       bertscore_f1           0.477999
              gpt35    judge                  1.000000
                       bertscore_precision    0.237243
                       bertscore_recall       0.471942
                       bertscore_f1           0.415955
              gpt4     judge                  1.000000
                       bertscore_precision    0.235668
                       bertscore_recall       0.453375
                       bertscore_f1           0.434757
              newbing  judge                  1.000000
                       bertscore_precision    0.092893
                       bertscore_recall       0.357710
                       bertscore_f1           0.269635
tq            chatgpt  judge                  1.000000
                       bertscore_precision    0.158161
                       bertscore_recall       0.069338
                       bertscore_f1           0.130286
              fid      judge                  1.000000
                       bertscore_precision    0.347997
                       bertscore_recall       0.007923
                       bertscore_f1           0.148861
              gpt35    judge                  1.000000
                       bertscore_precision    0.287891
                       bertscore_recall       0.095683
                       bertscore_f1           0.220131
              gpt4     judge                  1.000000
                       bertscore_precision    0.165552
                       bertscore_recall       0.066070
                       bertscore_f1           0.122902
              newbing  judge                  1.000000
                       bertscore_precision    0.002551
                       bertscore_recall       0.136704
                       bertscore_f1           0.117788

The paper uses f1 with a threshold of 0.5, but if I use this the accuracy is less than 20%

Similarly the lexical match metric shows an accuracy of 41% which is around half of the amount shown in the paper.

Could the evaluation code be released so that I can understand what my mistake is here?

wangcunxiang / QA-Eval

Unable to reproduce results #4