stanford-futuredata / ARES

https://ares-ai.vercel.app/
Apache License 2.0
442 stars 49 forks source link

Unclear about the code of calculate_ppi function #64

Open kitkhai opened 2 months ago

kitkhai commented 2 months ago

Hi

It seems unclear what is the reason for creating 20 random values ns = np.linspace(0, n_max, 20).astype(int) to determine the number of labelled data used to calculate the PPI, but in the end only the PPI value calculated from n_max is retained as seen from avg_ci = ci.mean(axis=0)[-1]

I don't understand what is the purpose of conducting multiple trials? Since only the only the PPI value calculated from n_max is retained, it should be constant throughout the trials isn't it?

https://github.com/stanford-futuredata/ARES/blob/2684d477878e515c3dc31cf4b91fb848a84bdb90/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py#L238-L281

Also, I saw there there were other functions from the original PPI repository (PPI bootstrap, cross PPI etc), what were you consideration when choosing which PPI function to use?

WJ44 commented 1 week ago

Correct me if I am wrong but I think this code resulted from a misinterpretation of the code examples from the PPI paper. There they have experiments to show how PPI compares to classic inference by running multiple trials with labelled samples from different parts of the dataset and different sizes. In practice, one would simply use the complete labelled dataset.