Open nshern opened 2 months ago
To evaluate the correctness of your RAG solution with LlamaIndex, especially if it's multi-modal, follow these steps:
Prepare Your Environment: Install necessary packages for multi-modal data handling:
pip install llama-index-llms-openai llama-index-multi-modal-llms-openai llama-index-multi-modal-llms-replicate
Set Up Your Multi-Modal RAG System: Initialize your system with MultiModalVectorStoreIndex
and attach a generator, such as OpenAIMultiModal
, with a PromptTemplate
for querying.
from llama_index.core.indices import MultiModalVectorStoreIndex
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
from llama_index.core import PromptTemplate
# Example setup
asl_index = MultiModalVectorStoreIndex(image_nodes + text_nodes)
openai_mm_llm = OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300)
qa_template = PromptTemplate(your_template_string)
rag_engine = asl_index.as_query_engine(multi_modal_llm=openai_mm_llm, text_qa_template=qa_template)
Generate and Evaluate Predictions: Use your RAG system to generate predictions, then evaluate them with evaluators like CorrectnessEvaluator
, MultiModalRelevancyEvaluator
, and MultiModalFaithfulnessEvaluator
for assessing correctness, relevancy, and faithfulness.
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import MultiModalRelevancyEvaluator, MultiModalFaithfulnessEvaluator
correctness_evaluator = CorrectnessEvaluator(llm=OpenAI(temperature=0, model="gpt-4"))
relevancy_evaluator = MultiModalRelevancyEvaluator(multi_modal_llm=OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300))
faithfulness_evaluator = MultiModalFaithfulnessEvaluator(multi_modal_llm=OpenAIMultiModal(model="gpt-4-vision-preview", max_new_tokens=300))
Analyze Results: Review the evaluation scores to understand your system's performance and identify improvement areas.
For detailed guidance, refer to the multi-modal RAG evaluation guide and the RAG evaluator base code in the LlamaIndex repository. This approach will help you effectively assess the correctness of your RAG solution's answers.
To continue the conversation, mention @dosu.
Question Validation
Question
I am trying to use the built-in capabilities of llamaindex to evaluate the correctness of my RAG solution. Looking at the example offered here
It seems like here we are evaluating against the base gpt-4 model, and not our rag solution.
So my question is, how do I evaluate the correctness in the answers of my rag solution?