stanford-futuredata / ARES

Automated Evaluation of RAG Systems
https://ares-ai.vercel.app/
Apache License 2.0
486 stars 53 forks source link

Iteration over labels and datasets not working in PPI #44

Closed WJ44 closed 5 months ago

WJ44 commented 5 months ago

For evaluating RAG systems, the PPI config allows specifying multiple datasets and labels. These labels and datasets are iterated over in the rag_scoring_config method, however there is a return statement in the loop so only the first combination is actually evaluated.

Let me know if you could look into this. I could also make a PR to solve this if you let me know what the expected return value should be in this case.

robbym-dev commented 5 months ago

Hi @WJ44

The issue with the return statement in the loop within the rag_scoring_config method has been resolved, allowing all combinations of datasets and labels to be correctly evaluated.

elsatch commented 5 months ago

Could you confirm @WJ44 if this issue has been solved on your side?

I have created an environment to test the results, using ares-ai 0.60.0 library and I am getting very strange results in the output. Every single dataset is different, yet the evaluation I get is the same for every model, like if the variable was being overwritten.

eval_datasets = ['/mnt/data/dataset1.tsv', '/mnt/data/work/dataset2.tsv', '/mnt/data/work/dataset3.tsv', '/mnt/data/work/dataset4.tsv']

ppi_config = {
    "evaluation_datasets": eval_datasets,
    "few_shot_examples_filepath": "data/interim/few_shot_prompt_filename_customized_pytorch_v2.tsv",
    "checkpoints": [
        "notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt",
        "notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Faithfulness_Label_few_shot_prompt_filename_customized_v2_568298.pt",
        "notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Relevance_Label_few_shot_prompt_filename_customized_v2_428380.pt",
    ],
    "labels": [
        #"Context_Relevance_Label",
        "Answer_Faithfulness_Label",
        #"Answer_Relevance_Label",
    ],
    "model_choice": "microsoft/mdeberta-v3-base",
    "GPT_scoring": False,
    # This file had to be modified manually to change the column names
    "gold_label_path": "data/interim/gold_queries_pytorch.tsv",
    "swap_human_labels_for_gpt4_labels": False,
}

ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)

In the output I am getting:

[{'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}]

Additionally, I am not getting the ARES ranking in the output.

Another thing related to the labels and datasets is that if I add more than one label in the output it fails producing any output at all. So for example, if I uncomment the Context_Relevance_Label at the above config and run the evaluation again using Labels: ['Context_Relevance_Label', 'Answer_Faithfulness_Label'] then the response is:

Loaded model from checkpoint: notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt
Traceback (most recent call last):                                                                                                                                                                                                                                                                                                                          
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
    return self._engine.get_loc(casted_key)
  File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
  File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Answer_Faithfulness_Label'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/data/work/external/pip-ARES-0_60_env/notebooks/1-10-cgs-ARES-evaluate-dataset_v2.py", line 132, in <module>
    results = ares.evaluate_RAG()
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/ares.py", line 144, in evaluate_RAG
    return rag_scoring_config(**self.ppi_config)
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/rag_scoring.py", line 130, in rag_scoring_config
    test_set, Y_labeled_dataset, Y_labeled_dataloader, Y_labeled_predictions, Yhat_unlabeled_dataset, prediction_column = post_process_predictions(post_process_settings)
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py", line 1042, in post_process_predictions
    test_set = test_set[test_set[label] != 0]
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
    indexer = self.columns.get_loc(key)
  File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
    raise KeyError(key) from err
KeyError: 'Answer_Faithfulness_Label'

Note that the 'Answer_Faithfulness_Label exists, as it was used for the first evaluation.

I have searched in the updated repo documentation and readme and all the multilabel, multidataset examples have disappeared. Every single example is using just one label and one dataset.

WJ44 commented 5 months ago

As far as I can tell, the code in the main branch here still has a return statement in the most nested for loop which seems like it would only evaluate the first combination but perhaps I am missing something.

robbym-dev commented 5 months ago

Issue resolved in PR #51