Closed elsatch closed 5 months ago
Using the provided code at tests/test_ARES_multiple_datasets.py along with the proposed code updates, I've been able to get different ARES rankings for different datasets, as described at #50
tests/test_ARES_multiple_datasets.py
These are the results:
[ [ { "Label_Column": "Context_Relevance_Label", "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.55.tsv", "ARES_Prediction": 0.4998818162969061, "ARES_Confidence_Interval": [0.448, 0.552], "Number_of_Examples_in_Evaluation_Set": 4823, "Ground_Truth_Performance": 0.55, "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.781, "Annotated_Examples_used_for_PPI": 300, }, { "Label_Column": "Context_Relevance_Label", "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.65.tsv", "ARES_Prediction": 0.5554300416564588, "ARES_Confidence_Interval": [0.503, 0.608], "Number_of_Examples_in_Evaluation_Set": 4081, "Ground_Truth_Performance": 0.65, "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792, "Annotated_Examples_used_for_PPI": 300, }, { "Label_Column": "Context_Relevance_Label", "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.7.tsv", "ARES_Prediction": 0.5838786279683288, "ARES_Confidence_Interval": [0.532, 0.636], "Number_of_Examples_in_Evaluation_Set": 3790, "Ground_Truth_Performance": 0.7, "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.798, "Annotated_Examples_used_for_PPI": 300, }, ], [ { "Label_Column": "Answer_Relevance_Label", "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.55.tsv", "ARES_Prediction": 0.5231259935033495, "ARES_Confidence_Interval": [0.467, 0.58], "Number_of_Examples_in_Evaluation_Set": 4823, "Ground_Truth_Performance": 0.55, "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.55, "Annotated_Examples_used_for_PPI": 300, }, { "Label_Column": "Answer_Relevance_Label", "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.65.tsv", "ARES_Prediction": 0.5230882953524442, "ARES_Confidence_Interval": [0.467, 0.58], "Number_of_Examples_in_Evaluation_Set": 4081, "Ground_Truth_Performance": 0.65, "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65, "Annotated_Examples_used_for_PPI": 300, }, { "Label_Column": "Answer_Relevance_Label", "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.7.tsv", "ARES_Prediction": 0.523069481090596, "ARES_Confidence_Interval": [0.467, 0.58], "Number_of_Examples_in_Evaluation_Set": 3790, "Ground_Truth_Performance": 0.7, "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.7, "Annotated_Examples_used_for_PPI": 300, }, ], ]
Using the provided code at
tests/test_ARES_multiple_datasets.py
along with the proposed code updates, I've been able to get different ARES rankings for different datasets, as described at #50These are the results: