Evaluating more than one dataset at a time returns incorrect results

Last weeks I have been evaluating the evaluation features of ARES without achieving the expected results. The errors I've found are related to #44, which was marked as closed, but never solved.

Given that the current status of the code (ares-ai pypi library 0.6.1) makes impossible to get proper ARES Ranking for different datasets in the final results, I decided to explore further.

Baseline

To establish an initial baseline, I executed the reference code from the Quick Start Guide 2. This is the relevant code:

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_unlabeled_output.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

The NQ datasets were downloaded using the wget commands from the setup part of the guide. The checkpoint wasn't trained but downloaded from the provided drive link.

These are the results:

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300

Test - Evaluating more than one dataset at a time

To test this example, we will download two different datasets from the NQ dataset, available from the repository at datasets/eval_datasets/nq, using the following commands:

wget https://github.com/stanford-futuredata/ARES/raw/main/datasets/eval_datasets/nq/nq_ratio_0.65.tsv
wget https://github.com/stanford-futuredata/ARES/raw/main/datasets/eval_datasets/nq/nq_ratio_0.7.tsv

This is the resulting code:

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

And these are the results:

--------------------------------------------------------
Evaluation Sets: ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv']
Checkpoints: ['checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt']
Labels: ['Context_Relevance_Label']
--------------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624]
ARES Confidence Interval: [[0.577, 0.694]]
Number of Examples in Evaluation Set: [4081]
Ground Truth Performance: [0.65]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792]
Annotated Examples used for PPI: 300
--------------------------------------------------
[...]
--------------------------------------------------
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6354300416564624, 0.6638786279683391]
ARES Confidence Interval: [[0.577, 0.694], [0.605, 0.722]]
Number of Examples in Evaluation Set: [4081, 3790]
Ground Truth Performance: [0.65, 0.7]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.792, 0.798]
Annotated Examples used for PPI: 300
--------------------------------------------------
# Reformated to make clear that the results are duplicated
[{'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300},
 {'ARES_Prediction': 0.6354300416564624, 'ARES_Confidence_Interval': [0.577, 0.694], 'Number_of_Examples_in_Evaluation_Set': 4081, 'Ground_Truth_Performance': 0.65, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.792, 'Annotated_Examples_used_for_PPI': 300}]

The evaluation returns first the results for the first dataset, then appends the results for the second dataset. Then in the final recap, it doubles the initial score into the second result, returning incorrect results.

This problem multiplies when analyzing several datasets and several labels. In that case the evaluation system keeps providing incorrect results when evaluating more than one dataset at a time. It overwrites the results of the second dataset with the results of the first dataset for the same label.

[
    # First label - First dataset
    {
        "ARES_Prediction": 0.6354300416564624,
        "ARES_Confidence_Interval": [0.577, 0.694],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # First label - should be second dataset. Duplicated
    {
        "ARES_Prediction": 0.6354300416564624,
        "ARES_Confidence_Interval": [0.577, 0.694],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # Second label - First dataset
    {
        "ARES_Prediction": 0.5664216286857816,
        "ARES_Confidence_Interval": [0.51, 0.622],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
        "Annotated_Examples_used_for_PPI": 300,
    },
    # Second label - should be second dataset. Duplicated
    {
        "ARES_Prediction": 0.5664216286857816,
        "ARES_Confidence_Interval": [0.51, 0.622],
        "Number_of_Examples_in_Evaluation_Set": 4081,
        "Ground_Truth_Performance": 0.65,
        "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
        "Annotated_Examples_used_for_PPI": 300,
    },
]

stanford-futuredata / ARES

Evaluating more than one dataset at a time returns incorrect results #50

Baseline

Test - Evaluating more than one dataset at a time