run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.38k stars 4.67k forks source link

[Question]: How to use LabelledRagDataset with RetrieverEvaluator or BatchEvalRunner #10445

Open austinmw opened 5 months ago

austinmw commented 5 months ago

Question Validation

Question

Hi, I have a LabelledRagDataset created with RagDatasetGenerator. Now how can I use RetrieverEvaluator or BatchEvalRunner with this? Are they compatible?

The documentation only mentions using it with a RagEvaluatorPack which is not customizable enough. My goal is to measure Hit Rate, MRR, Context Relevance, and Faithfulness, using Bedrock LLMs.

dosubot[bot] commented 5 months ago

🤖

Based on the information provided in the context, it seems that the LabelledRagDataset can be used with RagEvaluatorPack as shown in the code snippet. However, the RetrieverEvaluator or BatchEvalRunner are not mentioned in the context, so it's unclear how they can be used with LabelledRagDataset.

As for customizing the llm, embedding model, prompts, and adding hit rate and MRR metrics, the context does not provide enough information to give a complete answer. It's possible that these customizations would require modifications to the RagEvaluatorPack or the creation of a custom evaluator.

Here is the code snippet from the context that shows how to use LabelledRagDataset with RagEvaluatorPack:

from llama_index.llama_pack import download_llama_pack

RagEvaluatorPack = download_llama_pack("RagEvaluatorPack", "./pack")

rag_evaluator = RagEvaluatorPack(
    query_engine=query_engine,  # built with the same source Documents as the rag_dataset
    rag_dataset=rag_dataset,
)
benchmark_df = await rag_evaluator.run()

The query_engine should be built with the same source Documents as the rag_dataset. The benchmark_df contains the mean scores for evaluation measures such as Correctness, Relevancy, Faithfulness and Context Similarity.

You can find more information in the evaluating with LlamaDatasets guide.

If you could provide more details about your use case or the specific customizations you're trying to make, I might be able to give a more precise answer.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

austinmw commented 5 months ago

It looks like:

So currently we have three different evaluation mechanisms that require three different dataset formats. It seems LabelledRagDataset is a newer abstraction, and could be used as a single dataset format across all of these, so multiple different formats of essentially the same information is no longer needed?

austinmw commented 5 months ago

Bump 😅

dosubot[bot] commented 1 month ago

Hi, @austinmw,

I'm helping the LlamaIndex team manage our backlog and am marking this issue as stale. From what I understand, you were inquiring about using a LabelledRagDataset with RetrieverEvaluator or BatchEvalRunner and whether they are compatible. There was a discussion about the documentation mentioning the use of LabelledRagDataset with RagEvaluatorPack and the need for more customization. It was suggested that customizations might require modifications to the RagEvaluatorPack or the creation of a custom evaluator. You later pointed out that there are currently three different evaluation mechanisms that require three different dataset formats and suggested that LabelledRagDataset could be used as a single dataset format across all of these.

Is this issue still relevant to the latest version of the LlamaIndex repository? If so, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days.

Thank you for your understanding and contributions to the LlamaIndex project.

Dosu

austinmw commented 1 month ago

not stale.