Updated eval loop and output to fix #50

Using the provided code at tests/test_ARES_multiple_datasets.py along with the proposed code updates, I've been able to get different ARES rankings for different datasets, as described at #50
These are the results:
[
    [
        {
            "Label_Column": "Context_Relevance_Label",
            "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.55.tsv",
            "ARES_Prediction": 0.4998818162969061,
            "ARES_Confidence_Interval": [0.448, 0.552],
            "Number_of_Examples_in_Evaluation_Set": 4823,
            "Ground_Truth_Performance": 0.55,
            "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.781,
            "Annotated_Examples_used_for_PPI": 300,
        },
        {
            "Label_Column": "Context_Relevance_Label",
            "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.65.tsv",
            "ARES_Prediction": 0.5554300416564588,
            "ARES_Confidence_Interval": [0.503, 0.608],
            "Number_of_Examples_in_Evaluation_Set": 4081,
            "Ground_Truth_Performance": 0.65,
            "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.792,
            "Annotated_Examples_used_for_PPI": 300,
        },
        {
            "Label_Column": "Context_Relevance_Label",
            "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.7.tsv",
            "ARES_Prediction": 0.5838786279683288,
            "ARES_Confidence_Interval": [0.532, 0.636],
            "Number_of_Examples_in_Evaluation_Set": 3790,
            "Ground_Truth_Performance": 0.7,
            "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.798,
            "Annotated_Examples_used_for_PPI": 300,
        },
    ],
    [
        {
            "Label_Column": "Answer_Relevance_Label",
            "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.55.tsv",
            "ARES_Prediction": 0.5231259935033495,
            "ARES_Confidence_Interval": [0.467, 0.58],
            "Number_of_Examples_in_Evaluation_Set": 4823,
            "Ground_Truth_Performance": 0.55,
            "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.55,
            "Annotated_Examples_used_for_PPI": 300,
        },
        {
            "Label_Column": "Answer_Relevance_Label",
            "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.65.tsv",
            "ARES_Prediction": 0.5230882953524442,
            "ARES_Confidence_Interval": [0.467, 0.58],
            "Number_of_Examples_in_Evaluation_Set": 4081,
            "Ground_Truth_Performance": 0.65,
            "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.65,
            "Annotated_Examples_used_for_PPI": 300,
        },
        {
            "Label_Column": "Answer_Relevance_Label",
            "Evaluation_Set": "datasets/eval_datasets/nq/nq_ratio_0.7.tsv",
            "ARES_Prediction": 0.523069481090596,
            "ARES_Confidence_Interval": [0.467, 0.58],
            "Number_of_Examples_in_Evaluation_Set": 3790,
            "Ground_Truth_Performance": 0.7,
            "ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels": 0.7,
            "Annotated_Examples_used_for_PPI": 300,
        },
    ],
]
stanford-futuredata / ARES

Updated eval loop and output to fix #50 #51