[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge

louisbrulenaudet commented 3 weeks ago

Issue encountered

It would be good to have a system for evaluating both the relevance of the RAG and its use by the LLM in producing the response. My first intuition would be a multi-stage system with, on the one hand, an evaluation in the manner of Sentence Transformers' information-retrieval evaluator (@tomaarsen), and on the other hand, an evaluation with certain metrics of the model's ability to interpret the documents from the context provided by the RAG system (mutli-documents) in order to test :

Answer Relevancy
Faithfulness
Contextual Precision
Contextual Recall
Contextual Relevancy
Tool Correctness
Hallucination

Solution/Feature

For the moment, it seems to me that the simplest way to implement this system is to combine two tools, sentence-transformers and DeepEval, and then aggregate the results in the form of a synthetic indicator, but this method requires a multitude of libraries and a connection between the database necessary for the sentence-transformers and DeepEval evaluators.

tomaarsen commented 3 weeks ago

On the information retrieval (IR) side of things, we have a solid and extensive leaderboard called BEIR (Benchmark for Information Retrieval), which is included in MTEB (Massive Text Embedding Benchmark) under the Retrieval tab: https://huggingface.co/spaces/mteb/leaderboard. This will soon be extended to MMTEB with more languages, etc.

However, as you're noticing, this is only one part of the puzzle for RAG applications. I'm not very familiar with the various RAG evaluation tools out there, so I can't provide much feedback on that part I'm afraid.

Tom Aarsen

NathanHB commented 3 weeks ago

Hi ! Thanks for the issue. I'm not very familiar either with RAG evaluation and adding it to lighteval would reqquire some effort. If you are familiar with it, we would love it if you could open a PR with a Proof of Concept !

To adding RAG would require adding a RAG evaluation function in the base_model.py model, as well as providing the documents in the requests. Do you have a particular benchmark to add in mind ?

huggingface / lighteval

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

Issue encountered

Solution/Feature