Reproducing results - Githubissues

WJ44 commented 1 month ago

Hello,

I am trying to reproduce the results in the examples, and eventually the paper. However, when running the PPI step from the Collab notebook I get curious results.

I run the following:

from ares import ARES

ppi_config = {
    "evaluation_datasets": ["datasets/example_files/nq_unlabeled_output.tsv"],
    "checkpoints": ["Context_Relevance_Label_joint_trained_date_time.pt", "Answer_Relevance_Label_joint_trained_date_time.pt"],
    "labels": ["Context_Relevance_Label", "Answer_Relevance_Label"],
    "gold_label_paths": ["datasets/example_files/nq_labeled_output.tsv"],
}

ares_module = ARES(ppi=ppi_config)
results = ares_module.evaluate_RAG()
print(results)

However, for context relevance I get this:

Context_Relevance_Label Scoring
ARES Ranking
Evaluation_Set:datasets/example_files/nq_unlabeled_output.tsv
Checkpoint:Context_Relevance_Label_joint_trained_date_time.pt
ARES Prediction: [0.595499509914802]
ARES Confidence Interval: [[0.541, 0.65]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.629]
Annotated Examples used for PPI: 300

Which is exactly the numbers for answer relevance in the notebook.

For answer relevance I get:

Answer_Relevance_Label Scoring
ARES Ranking
Evaluation_Set:datasets/example_files/nq_unlabeled_output.tsv
Checkpoint:Answer_Relevance_Label_joint_trained_date_time.pt
ARES Prediction: [0.5955191133227766]
ARES Confidence Interval: [[0.577, 0.614]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.977]
Annotated Examples used for PPI: 300

Any idea why this might be happening?

Full results output:

[[{'Label_Column': 'Context_Relevance_Label', 'Evaluation_Set': 'datasets/example_files/nq_unlabeled_output.tsv', 'ARES_Prediction': 0.595499509914802, 'ARES_Confidence_Interval': [0.541, 0.65], 'Number_of_Examples_in_Evaluation_Set': 4421, 'Ground_Truth_Performance': 0.6, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.629, 'Annotated_Examples_used_for_PPI': 300}], [{'Label_Column': 'Answer_Relevance_Label', 'Evaluation_Set': 'datasets/example_files/nq_unlabeled_output.tsv', 'ARES_Prediction': 0.5955191133227766, 'ARES_Confidence_Interval': [0.577, 0.614], 'Number_of_Examples_in_Evaluation_Set': 4421, 'Ground_Truth_Performance': 0.6, 'ARES_LLM_Judge_Accuracy_on_Ground_Truth_Labels': 0.977, 'Annotated_Examples_used_for_PPI': 300}]]

robbym-dev commented 1 month ago

Hi @WJ44,

Thank you for your inquiry! To help debug this further, could you provide the exact links which you used to install the context relevance and answer relevance checkpoints?

WJ44 commented 1 month ago

Thanks for your reply!

That would be the following:

Context Relevance: https://drive.google.com/file/d/1SK4THhBlyXrwxf7v3s2SAd05ZclLY5FF/view?usp=sharing
Answer Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing

robbym-dev commented 1 month ago

Hi @WJ44 ,

Thank you for your patience!

We've identified and fixed the problem with the context relevance checkpoint in the Colab notebook. The colab notebook has been updated to the latest version and Context Relevance checkpoint links have been updated. Please use this new link for the context relevance checkpoint and it should resolve the issues you were facing:

Context Relevance: https://drive.google.com/file/d/1yg1q6WrCwq7q07YceZUsd7FLVuLNJEue/view?usp=sharing.

You should now be able to reproduce the following results:

Context_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.6056978059262574]
ARES Confidence Interval: [[0.547, 0.664]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.789]
Annotated Examples used for PPI: 300

Answer_Relevance_Label Scoring
ARES Ranking
ARES Prediction: [0.5955191133227766]
ARES Confidence Interval: [[0.577, 0.614]]
Number of Examples in Evaluation Set: [4421]
Ground Truth Performance: [0.6]
ARES LLM Judge Accuracy on Ground Truth Labels: [0.977]
Annotated Examples used for PPI: 300

Please let us know if you face any other issues. Thank you for using ARES!

stanford-futuredata / ARES

Reproducing results #65