huggingface / lighteval

Lighteval is your all-in-one toolkit for evaluating LLMs across multiple backends
MIT License
829 stars 99 forks source link

[FT] Evaluation using a multi-document RAG based on statistical tools and LLM as judge #379

Open louisbrulenaudet opened 3 weeks ago

louisbrulenaudet commented 3 weeks ago

Issue encountered

It would be good to have a system for evaluating both the relevance of the RAG and its use by the LLM in producing the response. My first intuition would be a multi-stage system with, on the one hand, an evaluation in the manner of Sentence Transformers' information-retrieval evaluator (@tomaarsen), and on the other hand, an evaluation with certain metrics of the model's ability to interpret the documents from the context provided by the RAG system (mutli-documents) in order to test :

Solution/Feature

For the moment, it seems to me that the simplest way to implement this system is to combine two tools, sentence-transformers and DeepEval, and then aggregate the results in the form of a synthetic indicator, but this method requires a multitude of libraries and a connection between the database necessary for the sentence-transformers and DeepEval evaluators.

tomaarsen commented 3 weeks ago

On the information retrieval (IR) side of things, we have a solid and extensive leaderboard called BEIR (Benchmark for Information Retrieval), which is included in MTEB (Massive Text Embedding Benchmark) under the Retrieval tab: https://huggingface.co/spaces/mteb/leaderboard. This will soon be extended to MMTEB with more languages, etc.

However, as you're noticing, this is only one part of the puzzle for RAG applications. I'm not very familiar with the various RAG evaluation tools out there, so I can't provide much feedback on that part I'm afraid.

NathanHB commented 3 weeks ago

Hi ! Thanks for the issue. I'm not very familiar either with RAG evaluation and adding it to lighteval would reqquire some effort. If you are familiar with it, we would love it if you could open a PR with a Proof of Concept !

To adding RAG would require adding a RAG evaluation function in the base_model.py model, as well as providing the documents in the requests. Do you have a particular benchmark to add in mind ?