Evaluation process only works with demo datasets, fails with any real dataset (that has only the columns described in the paper)

According to the section 3.3 of ARES paper:

"Ranking RAG Systems with Confidence Intervals

Once we have prepared our LLM judges, we need to use them to score and rank the competing RAG systems. To do this, ARES samples the in-domain query-document-answer triples produced by each RAG approach, and the judges label each triple, predicting their context relevance, answer faithfulness, and answer relevance. By averaging the individual predicted labels for each in-domain triple, we calculate the RAG system performance across each of the three metrics."

So, to evaluate a RAG configuration you should provide in-domain query-document-answer triples. ARES code in the repo doesn't support that claim and only works with the example datasets provided, that have all kinds of additional columns for benchmarking purposes.

This is a major issue because it makes it impossible to evaluate a real RAG configuration with your own data, that uses only the columns indicated in the ARES paper.

Baseline configuration

This is our sample code to evaluate a RAG configuration with the example datasets provided in the repo.

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65.tsv', 'nq_ratio_0.7.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt",
                    "checkpoints/ares_answer_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

This code returns an evaluation of the RAG configuration using the provided datasets and checkpoints.

How to reproduce the issue:

To reproduce the issue, we will eliminate all the non query-document-answer columns from the example datasets and try to evaluate a RAG configuration with them.

import pandas as pd

df_065 = pd.read_csv("nq_ratio_0.65.tsv", sep="\t")
df_07 = pd.read_csv("nq_ratio_0.7.tsv", sep="\t")

df_065 = df_065[["Query", "Document", "Answer"]]
df_07 = df_07[["Query", "Document", "Answer"]]
df_065.to_csv("nq_ratio_0.65_querydocanswer.tsv", sep="\t", index=False)
df_07.to_csv("nq_ratio_0.7_querydocanswer.tsv", sep="\t", index=False)

Note the original columns on the nq datasets:

print(df_065.columns)

# Index(['id', 'input', 'meta', 'output', 'wikipedia_id', 'Document',
#       'paragraph_number', 'Answer', 'Query', 'Context_Relevance_Label',
#       'Answer_Faithfulness_Label', 'Answer_Relevance_Label'],
#      dtype='object')

The new datasets only have the query-document-answer columns. Now we will try to evaluate those configurations again.

from ares import ARES

ppi_config = { 
    "evaluation_datasets": ['nq_ratio_0.65_querydocanswer.tsv', 'nq_ratio_0.7_querydocanswer.tsv'], 
    "few_shot_examples_filepath": "nq_few_shot_prompt_for_judge_scoring.tsv",
    "checkpoints": ["checkpoints/ares_context_relevance_general_checkpoint_V1.1.pt",
                    "checkpoints/ares_answer_relevance_general_checkpoint_V1.1.pt"], 
    "rag_type": "question_answering", 
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"], 
    "gold_label_path": "nq_labeled_output.tsv", 
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

This code will raise an error when accessing the second label:

Traceback (most recent call last):                                                                                                                                                                                             
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Answer_Relevance_Label'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data/work/external/pip_ARES_061/test_only_sample_datasets_bug.py", line 14, in <module>
    results = ares.evaluate_RAG()
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/ares/ares.py", line 144, in evaluate_RAG
    return rag_scoring_config(**self.ppi_config)
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/ares/rag_scoring.py", line 141, in rag_scoring_config
    test_set, Y_labeled_dataset, Y_labeled_dataloader, Y_labeled_predictions, Yhat_unlabeled_dataset, prediction_column = post_process_predictions(post_process_settings)
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py", line 1042, in post_process_predictions
    test_set = test_set[test_set[label] != 0]
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/data/work/external/pip_ARES_061/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'Answer_Relevance_Label'

This error comes from the post_process_predictions(). Right after the evaluation from the first label ends and starts post processing the predictions, the code iterates over the label_columns trying to remove non valid records. Given that it tries to access a non existing column, pandas raises an error and aborts the evaluation process.

  if label_column in test_set.columns:
        test_set = test_set[test_set[label_column].notna()]

    for label in labels:
        if label != label_column:
            test_set = test_set[test_set[label] != 0]

Secondary issue

Given the issue, it could be tempting just to add the missing columns to the datasets, without adding any values to them. This will result in an error too in the preprocess_data(), as the whole process is so tied to the example datasets!

When adding the columns, given the content is empty, the error "Insufficient Data: Dataset has fewer than 10 rows after filtering!" will be raised.

# All records will be dropped here as the column is full of NaNs

if label_column in test_set.columns:
        test_set = test_set[test_set[label_column].notna()]

    # Combine query and document (and answer if applicable) into the text column
    # [..] if "Context" in label_column:

    # Preprocessing will fail given all rows have been dropped due to full of NaNs.

    # Check if the dataset has fewer than 10 rows after filtering
    if len(test_set) < 10:
        raise ValueError("Insufficient Data: Dataset has fewer than 10 rows after filtering!")

Filling the columns again with random data will make the evaluation process to run, but this behavior is totally counter of what a robust RAG evaluation framework should be, as these columns might confuse final users, induce fake results, etc.

import pandas as pd
import random

df_065 = pd.read_csv("nq_ratio_0.65_querydocanswer.tsv", sep="\t")
df_07 = pd.read_csv("nq_ratio_0.7_querydocanswer.tsv", sep="\t")

# Fill the Context_Relevance_Label and Answer_Relevance_Label columns with random data
df_065["Context_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_065))]
df_065["Answer_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_065))]
df_07["Context_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_07))]
df_07["Answer_Relevance_Label"] = [random.randint(0, 1) for _ in range(len(df_07))]
df_065.to_csv("nq_ratio_0.65_random_label_values.tsv", sep="\t", index=False)
df_07.to_csv("nq_ratio_0.7_random_label_values.tsv", sep="\t", index=False)

Note: I have launched the evaluation process with the random data and it kinda worked, but got out of memory after running for like 4 hours. It should work with that data, but I have not completely verified that.

Expected behavior

I expect the code for ARES repo to follow the description in the paper, allowing users to evaluate real RAG configurations instead of working only with demo datasets that incorporate additional columns. These additional columns are not described as required in the paper nor it seems to be a good practice to force users to add them to their datasets, incorporating random fake data.

stanford-futuredata / ARES