Closed WJ44 closed 5 months ago
Hi @WJ44
The issue with the return statement in the loop within the rag_scoring_config method has been resolved, allowing all combinations of datasets and labels to be correctly evaluated.
Could you confirm @WJ44 if this issue has been solved on your side?
I have created an environment to test the results, using ares-ai 0.60.0 library and I am getting very strange results in the output. Every single dataset is different, yet the evaluation I get is the same for every model, like if the variable was being overwritten.
eval_datasets = ['/mnt/data/dataset1.tsv', '/mnt/data/work/dataset2.tsv', '/mnt/data/work/dataset3.tsv', '/mnt/data/work/dataset4.tsv']
ppi_config = {
"evaluation_datasets": eval_datasets,
"few_shot_examples_filepath": "data/interim/few_shot_prompt_filename_customized_pytorch_v2.tsv",
"checkpoints": [
"notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt",
"notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Faithfulness_Label_few_shot_prompt_filename_customized_v2_568298.pt",
"notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Answer_Relevance_Label_few_shot_prompt_filename_customized_v2_428380.pt",
],
"labels": [
#"Context_Relevance_Label",
"Answer_Faithfulness_Label",
#"Answer_Relevance_Label",
],
"model_choice": "microsoft/mdeberta-v3-base",
"GPT_scoring": False,
# This file had to be modified manually to change the column names
"gold_label_path": "data/interim/gold_queries_pytorch.tsv",
"swap_human_labels_for_gpt4_labels": False,
}
ares = ARES(ppi=ppi_config)
results = ares.evaluate_RAG()
print(results)
In the output I am getting:
[{'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}, {'ARES_Prediction': 0.6800000000000054, 'ARES_Confidence_Interval': [0.627, 0.733], 'Number_of_Examples_in_Evaluation_Set': 100, 'Ground_Truth_Performance': None, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': None, 'Annotated_Examples_used_for_PPI': 300}]
Additionally, I am not getting the ARES ranking in the output.
Another thing related to the labels and datasets is that if I add more than one label in the output it fails producing any output at all. So for example, if I uncomment the Context_Relevance_Label at the above config and run the evaluation again using Labels: ['Context_Relevance_Label', 'Answer_Faithfulness_Label'] then the response is:
Loaded model from checkpoint: notebooks/checkpoints/microsoft-mdeberta-v3-base/5e-06_1_True_Context_Relevance_Label_few_shot_prompt_filename_customized_v2_545281.pt
Traceback (most recent call last):
File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3805, in get_loc
return self._engine.get_loc(casted_key)
File "index.pyx", line 167, in pandas._libs.index.IndexEngine.get_loc
File "index.pyx", line 196, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 7081, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 7089, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'Answer_Faithfulness_Label'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/data/work/external/pip-ARES-0_60_env/notebooks/1-10-cgs-ARES-evaluate-dataset_v2.py", line 132, in <module>
results = ares.evaluate_RAG()
File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/ares.py", line 144, in evaluate_RAG
return rag_scoring_config(**self.ppi_config)
File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/rag_scoring.py", line 130, in rag_scoring_config
test_set, Y_labeled_dataset, Y_labeled_dataloader, Y_labeled_predictions, Yhat_unlabeled_dataset, prediction_column = post_process_predictions(post_process_settings)
File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py", line 1042, in post_process_predictions
test_set = test_set[test_set[label] != 0]
File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/frame.py", line 4102, in __getitem__
indexer = self.columns.get_loc(key)
File "/mnt/data/work/external/pip-ARES-0_60_env/.venv/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 3812, in get_loc
raise KeyError(key) from err
KeyError: 'Answer_Faithfulness_Label'
Note that the 'Answer_Faithfulness_Label exists, as it was used for the first evaluation.
I have searched in the updated repo documentation and readme and all the multilabel, multidataset examples have disappeared. Every single example is using just one label and one dataset.
As far as I can tell, the code in the main branch here still has a return statement in the most nested for loop which seems like it would only evaluate the first combination but perhaps I am missing something.
Issue resolved in PR #51
For evaluating RAG systems, the PPI config allows specifying multiple datasets and labels. These labels and datasets are iterated over in the rag_scoring_config method, however there is a return statement in the loop so only the first combination is actually evaluated.
Let me know if you could look into this. I could also make a PR to solve this if you let me know what the expected return value should be in this case.