microsoft / rag-experiment-accelerator

The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
https://github.com/microsoft/rag-experiment-accelerator
Other
193 stars 68 forks source link

Implement appropriate methods that can estimate confidence intervals for metrics leveraging recent methods for combining AI-based metrics with human annotations #596

Closed dmavroeid closed 5 months ago

dmavroeid commented 5 months ago

In order to evaluate and compare the performances of different RAG LLM configurations/parameter choices, the appropriate metrics need to be used and also a framework for quantifying the statistical uncertainty associated with them. For example if we run an experiment where we observe that Reranking method A produces an average context_relevance score of 0.7 and Reranking method B produces an average context_relevance score of 0.8 and the 95%-confidence intervals associated with these estimates is large (say +-0.3) we should not conclude that one method is better than the other, but rather collect more evidence to make the comparison statistically significant.

There are different approaches that can be implemented for computing this uncertainty, that have different properties.

This ticket is related to the implementation of uncertainty estimation methods (like confidence intervals) for all three cases (AI-model based, direct human-feedback, or combination of both).