The RAG Experiment Accelerator is a versatile tool designed to expedite and facilitate the process of conducting experiments and evaluations using Azure Cognitive Search and RAG pattern.
We currently leverage some llm based evaluation metrics from ragas: https://github.com/explodinggradients/ragas
namely, llm_context_precision, llm_context_recall and llm_answer_relevance in this function compute_llm_based_score. These are the RAG triad of metrics.
For rag usecases, however we have an alternative llm-as-a-judge framework provided by promptflow-evals (supported by Microsoft and part of promptflow): https://pypi.org/project/promptflow-evals/
This evaluation framework has quality metrics such as relevance that can be leveraged for answer relevance or context precision. It has a targeted prompt for groundedness. promptflow-evals also has other quality metrics such as coherence, style, fluency, similarity. Moreover, the package also can enable inclusion of safety metrics such hate unfairness, violence, sexual among others.
Ideally, this can serve as a replacement for ragas metrics, but we can integrate promptflow-evals first and make a decision about removing ragas in a subsequent issue given many might be using ragas metrics.
We currently leverage some llm based evaluation metrics from ragas: https://github.com/explodinggradients/ragas namely,
llm_context_precision
,llm_context_recall
andllm_answer_relevance
in this functioncompute_llm_based_score
. These are the RAG triad of metrics.For rag usecases, however we have an alternative llm-as-a-judge framework provided by promptflow-evals (supported by Microsoft and part of promptflow): https://pypi.org/project/promptflow-evals/
This evaluation framework has quality metrics such as
relevance
that can be leveraged for answer relevance or context precision. It has a targeted prompt for groundedness. promptflow-evals also has other quality metrics such as coherence, style, fluency, similarity. Moreover, the package also can enable inclusion of safety metrics such hate unfairness, violence, sexual among others.Ideally, this can serve as a replacement for ragas metrics, but we can integrate promptflow-evals first and make a decision about removing ragas in a subsequent issue given many might be using ragas metrics.