Closed WJ44 closed 1 month ago
Hi @WJ44,
Thank you for your inquiry! To help debug this further, could you provide the exact links which you used to install the context relevance and answer relevance checkpoints?
Thanks for your reply!
That would be the following:
Hi @WJ44 ,
Thank you for your patience!
We've identified and fixed the problem with the context relevance checkpoint in the Colab notebook. The colab notebook has been updated to the latest version and Context Relevance checkpoint links have been updated. Please use this new link for the context relevance checkpoint and it should resolve the issues you were facing:
Context Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing.
You should now be able to reproduce the following results:
Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300
Answer_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.5955191133227766]
ARES Confidence Interval: [[0.577, 0.614]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.977]
Annotated Examples used for PPI: 300
Please let us know if you face any other issues. Thank you for using ARES!
Hello,
I am trying to reproduce the results in the examples, and eventually the paper. However, when running the PPI step from the Collab notebook I get curious results.
I run the following:
However, for context relevance I get this:
Which is exactly the numbers for answer relevance in the notebook.
For answer relevance I get:
Any idea why this might be happening?
Full results output:
[[{'Label_Column': 'Context_Relevance_Label', 'Evaluation_Set': 'datasets/example_files/nq_unlabeled_output.tsv', 'ARES_Prediction': 0.595499509914802, 'ARES_Confidence_Interval': [0.541, 0.65], 'Number_of_Examples_in_Evaluation_Set': 4421, 'Ground_Truth_Performance': 0.6, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.629, 'Annotated_Examples_used_for_PPI': 300}], [{'Label_Column': 'Answer_Relevance_Label', 'Evaluation_Set': 'datasets/example_files/nq_unlabeled_output.tsv', 'ARES_Prediction': 0.5955191133227766, 'ARES_Confidence_Interval': [0.577, 0.614], 'Number_of_Examples_in_Evaluation_Set': 4421, 'Ground_Truth_Performance': 0.6, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.977, 'Annotated_Examples_used_for_PPI': 300}]]