run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.32k stars 4.66k forks source link

[Question]: Evaluating correctness of my RAG solution #13063

Open nshern opened 2 months ago

nshern commented 2 months ago

Question Validation

Question

I am trying to use the built-in capabilities of llamaindex to evaluate the correctness of my RAG solution. Looking at the example offered here

It seems like here we are evaluating against the base gpt-4 model, and not our rag solution.

So my question is, how do I evaluate the correctness in the answers of my rag solution?

dosubot[bot] commented 2 months ago

To evaluate the correctness of your RAG solution with LlamaIndex, especially if it's multi-modal, follow these steps:

  1. Prepare Your Environment: Install necessary packages for multi-modal data handling:

    pip install llama-index-llms-openai llama-index-multi-modal-llms-openai llama-index-multi-modal-llms-replicate
  2. Set Up Your Multi-Modal RAG System: Initialize your system with MultiModalVectorStoreIndex and attach a generator, such as OpenAIMultiModal, with a PromptTemplate for querying.

    from llama_index.core.indices import MultiModalVectorStoreIndex
    from llama_index.multi_modal_llms.openai import OpenAIMultiModal
    from llama_index.core import PromptTemplate
    
    # Example setup
    asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
    openai_mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300)
    qa_template = PromptTemplate(your_template_string)
    rag_engine = asl_index.as_query_engine(multi_modal_llm=openai_mm_llm, text_qa_template=qa_template)
  3. Generate and Evaluate Predictions: Use your RAG system to generate predictions, then evaluate them with evaluators like CorrectnessEvaluator, MultiModalRelevancyEvaluator, and MultiModalFaithfulnessEvaluator for assessing correctness, relevancy, and faithfulness.

    from llama_index.core.evaluation import CorrectnessEvaluator
    from llama_index.core.evaluation.multi_modal import MultiModalRelevancyEvaluator, MultiModalFaithfulnessEvaluator
    
    correctness_evaluator = CorrectnessEvaluator(llm=OpenAI(temperature=0, model="gpt-4"))
    relevancy_evaluator = MultiModalRelevancyEvaluator(multi_modal_llm=OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300))
    faithfulness_evaluator = MultiModalFaithfulnessEvaluator(multi_modal_llm=OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300))
  4. Analyze Results: Review the evaluation scores to understand your system's performance and identify improvement areas.

For detailed guidance, refer to the multi-modal RAG evaluation guide and the RAG evaluator base code in the LlamaIndex repository. This approach will help you effectively assess the correctness of your RAG solution's answers.

#### Details **Sources** - [llama-index-packs/llama-index-packs-rag-evaluator/llama_index/packs/rag_evaluator/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-rag-evaluator/llama_index/packs/rag_evaluator/base.py) - [docs/docs/examples/evaluation/multi_modal/multi_modal_rag_evaluation.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/evaluation/multi_modal/multi_modal_rag_evaluation.ipynb) - [llama-index-legacy/llama_index/legacy/evaluation/correctness.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/evaluation/correctness.py) - [llama-index-core/llama_index/core/evaluation/correctness.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/correctness.py) - [llama-index-packs/llama-index-packs-rag-evaluator/examples/example.py](https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-rag-evaluator/examples/example.py) - [llama-index-core/llama_index/core/llama_dataset/rag.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/llama_dataset/rag.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.