Open kitkhai opened 2 months ago
Correct me if I am wrong but I think this code resulted from a misinterpretation of the code examples from the PPI paper. There they have experiments to show how PPI compares to classic inference by running multiple trials with labelled samples from different parts of the dataset and different sizes. In practice, one would simply use the complete labelled dataset.
Hi
It seems unclear what is the reason for creating 20 random values
ns = np.linspace(0, n_max, 20).astype(int)
to determine the number of labelled data used to calculate the PPI, but in the end only the PPI value calculated fromn_max
is retained as seen fromavg_ci = ci.mean(axis=0)[-1]
I don't understand what is the purpose of conducting multiple trials? Since only the only the PPI value calculated from
n_max
is retained, it should be constant throughout the trials isn't it?https://github.com/stanford-futuredata/ARES/blob/2684d477878e515c3dc31cf4b91fb848a84bdb90/ares/RAG_Automatic_Evaluation/LLMJudge_RAG_Compared_Scoring.py#L238-L281
Also, I saw there there were other functions from the original PPI repository (PPI bootstrap, cross PPI etc), what were you consideration when choosing which PPI function to use?