[Question]: Evaluating correctness of my RAG solution

To evaluate the correctness of your RAG solution with LlamaIndex, especially if it's multi-modal, follow these steps:

Prepare Your Environment: Install necessary packages for multi-modal data handling:

pip install llama-index-llms-openai llama-index-multi-modal-llms-openai llama-index-multi-modal-llms-replicate

Set Up Your Multi-Modal RAG System: Initialize your system with MultiModalVectorStoreIndex and attach a generator, such as OpenAIMultiModal, with a PromptTemplate for querying.

from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate

# Example setup
asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
openai_mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300)
qa_template = PromptTemplate(your_template_string)
rag_engine = asl_index.as_query_engine(multi_modal_llm=openai_mm_llm, text_qa_template=qa_template)

Generate and Evaluate Predictions: Use your RAG system to generate predictions, then evaluate them with evaluators like CorrectnessEvaluator, MultiModalRelevancyEvaluator, and MultiModalFaithfulnessEvaluator for assessing correctness, relevancy, and faithfulness.

from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import MultiModalRelevancyEvaluator, MultiModalFaithfulnessEvaluator

correctness_evaluator = CorrectnessEvaluator(llm=OpenAI(temperature=0, model="gpt-4"))
relevancy_evaluator = MultiModalRelevancyEvaluator(multi_modal_llm=OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300))
faithfulness_evaluator = MultiModalFaithfulnessEvaluator(multi_modal_llm=OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300))

Analyze Results: Review the evaluation scores to understand your system's performance and identify improvement areas.

For detailed guidance, refer to the multi-modal RAG evaluation guide and the RAG evaluator base code in the LlamaIndex repository. This approach will help you effectively assess the correctness of your RAG solution's answers.

#### Details

**Sources** - [llama-index-packs/llama-index-packs-rag-evaluator/llama_index/packs/rag_evaluator/base.py](https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-rag-evaluator/llama_index/packs/rag_evaluator/base.py) - [docs/docs/examples/evaluation/multi_modal/multi_modal_rag_evaluation.ipynb](https://github.com/run-llama/llama_index/blob/main/docs/docs/examples/evaluation/multi_modal/multi_modal_rag_evaluation.ipynb) - [llama-index-legacy/llama_index/legacy/evaluation/correctness.py](https://github.com/run-llama/llama_index/blob/main/llama-index-legacy/llama_index/legacy/evaluation/correctness.py) - [llama-index-core/llama_index/core/evaluation/correctness.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/evaluation/correctness.py) - [llama-index-packs/llama-index-packs-rag-evaluator/examples/example.py](https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-rag-evaluator/examples/example.py) - [llama-index-core/llama_index/core/llama_dataset/rag.py](https://github.com/run-llama/llama_index/blob/main/llama-index-core/llama_index/core/llama_dataset/rag.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

run-llama / llama_index

[Question]: Evaluating correctness of my RAG solution #13063

Question Validation

Question