Implement appropriate methods that can estimate confidence intervals for metrics leveraging recent methods for combining AI-based metrics with human annotations

In order to evaluate and compare the performances of different RAG LLM configurations/parameter choices, the appropriate metrics need to be used and also a framework for quantifying the statistical uncertainty associated with them. For example if we run an experiment where we observe that Reranking method A produces an average context_relevance score of 0.7 and Reranking method B produces an average context_relevance score of 0.8 and the 95%-confidence intervals associated with these estimates is large (say +-0.3) we should not conclude that one method is better than the other, but rather collect more evidence to make the comparison statistically significant.

There are different approaches that can be implemented for computing this uncertainty, that have different properties.

Confidence interval computation for metrics derived from AI-models: There are several metrics that are computed using an LLM, an embedding or a combination of them. For example context relevance and answer relevance rely on an LLM for computing the metric scores. Since LLMs and AI-models in general may be biaed (in the statistical sense) this will affect the metric computations. On the other hand, since AI-models can scale easily and be applied to a large collection of data (say question, context pairs to compute the context relevance metric), the relevant confidence intervals will usually have low variance.
Confidence interval computation for metrics derived from direct human-feedback: Certain metrics can also be computed with direct human feedback. For example a simple thumb-up/down may (depending on instructions given to human evaluators) directly be used to compute answer relevance. With direct human-input for computing this metric, we can have low bias in the confidence interval estimation, but because typically the amount of data end-users can directly evaluate is low, these confidence intervals will have high variance.
Confidence interval computation that combines AI-model output with direct human-feedback: There are methods, (like prediction-powered inference https://www.science.org/doi/10.1126/science.adi6000) that can combine AI-model computation of metrics with direct human feedback, taking the best of both worlds (low-variance of AI-based metric computation with low-bias of direct human feedback).

This ticket is related to the implementation of uncertainty estimation methods (like confidence intervals) for all three cases (AI-model based, direct human-feedback, or combination of both).

microsoft / rag-experiment-accelerator

Implement appropriate methods that can estimate confidence intervals for metrics leveraging recent methods for combining AI-based metrics with human annotations #596