run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.46k stars 5k forks source link

[Question]: Issues with Context Generation and Metric Support in Llama-Index Evaluation for RAG Retrieval Service #15939

Open adityamity opened 1 week ago

adityamity commented 1 week ago

Question Validation

Question

When evaluating a RAG retrieval service using the llama-index evaluation method, I encountered the following issues:

In the first step, we generate question-context pairs using the code:

from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

However, the output returns eval_result is an EmbeddingQAFinetuneDataset object containing 'queries', 'corpus', and 'relevant_docs'. I don't see any 'context' in this eval_result. How is the evaluation being performed using only the 'queries'? Is there any way to generate ground_truth along with context?

When using RetrieverEvaluator, I noticed it supports only 'hit_rate' and 'mrr'. Attempting to use 'precision' and 'recall' results in the error: ValueError: Invalid metric name: precision. Could you confirm which version of llama-index supports all metrics, including precision and recall? For reference, I am using the following versions:

llama-index==0.10.13.post1
llama-index-agent-openai==0.1.5
llama-index-cli==0.1.5
llama-index-core==0.10.13
dosubot[bot] commented 1 week ago

The EmbeddingQAFinetuneDataset object returned by generate_question_context_pairs contains 'queries', 'corpus', and 'relevant_docs', but not 'context'. The evaluation using RetrieverEvaluator is performed using the 'queries' and 'relevant_docs' fields. The 'queries' represent the questions, and 'relevant_docs' are the corresponding relevant document IDs [1].

To generate ground_truth along with context, you can use the LabelledEvaluatorDataExample class, which includes fields for ground_truth_answer and contexts. This class allows for a structured way to generate and evaluate ground_truth along with context within the LlamaIndex framework [2][3].

Regarding the support for evaluation metrics, the RetrieverEvaluator class supports metrics such as 'hit_rate', 'mrr', 'precision', 'recall', 'ap', 'ndcg', and 'cohere_rerank_relevancy' (optional). However, if you encounter a ValueError for 'precision' and 'recall', it might be due to version-specific changes. The latest versions of LlamaIndex should support these metrics, and the test cases validate their computation [4][5][6].

Ensure you are using the latest version of LlamaIndex to access all supported metrics. If you are still facing issues, consider updating to the most recent version.

To continue talking to Dosu, mention @dosu.

adityamity commented 1 week ago

@dosu : In the below code, i am passing my query_engine

from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr"]#, "precision", "recall", "ap", "ndcg"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=query_engine
)

my query_engine type is llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine.

so i am getting error :

ValidationError: 1 validation error for RetrieverEvaluator
retriever
  instance of BaseRetriever expected (type=type_error.arbitrary_type; expected_arbitrary_type=BaseRetriever)
dosubot[bot] commented 1 week ago

The error you're encountering is due to the fact that RetrieverEvaluator expects an instance of BaseRetriever, not RetrieverQueryEngine. You need to pass a BaseRetriever object to the RetrieverEvaluator.

Here's how you can modify your code to create a BaseRetriever from your RetrieverQueryEngine:

from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core.query_engine.retriever_query_engine import RetrieverQueryEngine

# Assuming you have a RetrieverQueryEngine instance
query_engine = RetrieverQueryEngine(...)

# Extract the retriever from the query engine
retriever = query_engine._retriever

# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Create the RetrieverEvaluator with the extracted retriever
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

This should resolve the ValidationError by ensuring that the RetrieverEvaluator receives a BaseRetriever instance as expected [1].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 week ago

@dosu I am using the below code for evaluating ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]:

from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
)

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In the above code, qa_dataset contains only 'queries', 'corpus' and 'relevant_docs'. Which is not fulfilling my requirement as I want to test my RAG based on ground_truth , context and answer generated by RAG.

Can you please provide me the end to end code to perform such test as i want to mainly focus on evaluating Context precision and context recall. It would be great if you can provide me the code in the same way I provided my code.

dosubot[bot] commented 1 week ago

Here's an end-to-end code example to evaluate a RAG retrieval service based on ground_truth, context, and answer, focusing on context precision and context recall using the LlamaIndex framework:

import nest_asyncio
import logging
import sys
import os
import pandas as pd

from llama_index.core.evaluation import (
    DatasetGenerator,
    FaithfulnessEvaluator,
    RelevancyEvaluator,
    CorrectnessEvaluator,
    RetrieverEvaluator,
    generate_question_context_pairs,
)

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

from llama_index.llms.openai import OpenAI

# Attach to the same event-loop
nest_asyncio.apply()

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []
# Set up the StreamHandler to output to sys.stdout
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-..."

# Download and load data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

# Generate questions
gpt4 = OpenAI(model="gpt-4", temperature=0.1)
dataset_generator = DatasetGenerator.from_documents(documents, llm=gpt4, show_progress=True)
eval_dataset = dataset_generator.generate_dataset_from_nodes(num=20)
eval_queries = list(eval_dataset.queries.values())

# Fix evaluation query
eval_query = "How did the author describe their early attempts at writing short stories?"

# Fix GPT-3.5-TURBO LLM for generating response
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")

# Create vector index
vector_index = VectorStoreIndex.from_documents(documents, llm=gpt35)
query_engine = vector_index.as_query_engine()

# Retrieve nodes
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)

# Faithfulness Evaluator
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4)
response_vector = query_engine.query(eval_query)
eval_result_faithfulness = faithfulness_evaluator.evaluate_response(response=response_vector)

# Relevancy Evaluator
relevancy_evaluator = RelevancyEvaluator(llm=gpt4)
eval_result_relevancy = relevancy_evaluator.evaluate_response(query=eval_query, response=response_vector)

# Correctness Evaluator
correctness_evaluator = CorrectnessEvaluator(llm=gpt4)
query = "Can you explain the theory of relativity proposed by Albert Einstein in detail?"
reference = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).

General relativity, published in 1915, extended these ideas to include the effects of gravity. According to general relativity, gravity is not a force between masses, as described by Newton's theory of gravity, but rather the result of the warping of space and time by mass and energy. Massive objects, such as planets and stars, cause a curvature in spacetime, and smaller objects follow curved paths in response to this curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet, causing it to create a depression that other objects (representing smaller masses) naturally move towards.

In essence, general relativity provided a new understanding of gravity, explaining phenomena like the bending of light by gravity (gravitational lensing) and the precession of the orbit of Mercury. It has been confirmed through numerous experiments and observations and has become a fundamental theory in modern physics.
"""
response = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).
However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
"""
correctness_result = correctness_evaluator.evaluate(query=query, response=response, reference=reference)

# Retrieval Evaluation
from llama_index.core.text_splitter import SentenceSplitter

parser = SentenceSplitter(chunk_size=1024, chunk_overlap=100)
nodes = parser(documents)
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)
retrieved_nodes = retriever.retrieve(eval_query)

qa_dataset = generate_question_context_pairs(nodes, llm=gpt4, num_questions_per_chunk=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(["mrr", "hit_rate"], retriever=retriever)

# Evaluate on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]
eval_result_retriever = retriever_evaluator.evaluate(sample_query, sample_expected)

# Evaluate on the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()

    metric_df = pd.DataFrame({"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]})
    return metric_df

display_results("top-2 eval", eval_results)

Additionally, you can leverage the RAGChecker framework for a more comprehensive evaluation of your RAG system:

# Install necessary packages
%pip install -qU ragchecker llama-index

# Import necessary libraries
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from ragchecker.integrations.llama_index import response_to_rag_results
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics

# Load documents
documents = SimpleDirectoryReader("path/to/your/documents").load_data()

# Create index
index = VectorStoreIndex.from_documents(documents)

# Create query engine
rag_application = index.as_query_engine()

# User query and ground truth answer
user_query = "What is RAGChecker?"
gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."

# Get response from LlamaIndex
response_object = rag_application.query(user_query)

# Convert to RAGChecker format
rag_result = response_to_rag_results(
    query=user_query,
    gt_answer=gt_answer,
    response_object=response_object,
)

# Create RAGResults object
rag_results = RAGResults.from_dict({"results": [rag_result]})
print(rag_results)

# Initialize RAGChecker
evaluator = RAGChecker(
    extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
    checker_name="bedrock/meta.llama3-70b-instruct-v1:0",
    batch_size_extractor=32,
    batch_size_checker=32,
)

# Evaluate using RAGChecker
evaluator.evaluate(rag_results, all_metrics)
# Print detailed results
print(rag_results)

This integration provides a powerful tool for developing and refining more effective RAG applications by leveraging RAGChecker's comprehensive metrics [1][2][3].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 week ago

@dosu in the provided solution, context recall and precision is not been evaluated. I want to evaluate Retrieval with precision and recall

dosubot[bot] commented 1 week ago

To evaluate retrieval with precision and recall in the LlamaIndex framework, you can use the Precision and Recall classes provided in the llama_index.core.evaluation.retrieval.metrics module. Below is an end-to-end example that includes the evaluation of retrieval with precision and recall:

import nest_asyncio
import logging
import sys
import os
import pandas as pd

from llama_index.core.evaluation import (
    DatasetGenerator,
    RetrieverEvaluator,
    generate_question_context_pairs,
    Precision,
    Recall
)

from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex,
    Response,
)

from llama_index.llms.openai import OpenAI

# Attach to the same event-loop
nest_asyncio.apply()

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []
# Set up the StreamHandler to output to sys.stdout
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-..."

# Download and load data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'

reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()

# Generate questions
gpt4 = OpenAI(model="gpt-4", temperature=0.1)
dataset_generator = DatasetGenerator.from_documents(documents, llm=gpt4, show_progress=True)
eval_dataset = dataset_generator.generate_dataset_from_nodes(num=20)
eval_queries = list(eval_dataset.queries.values())

# Fix evaluation query
eval_query = "How did the author describe their early attempts at writing short stories?"

# Fix GPT-3.5-TURBO LLM for generating response
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")

# Create vector index
vector_index = VectorStoreIndex.from_documents(documents, llm=gpt35)
query_engine = vector_index.as_query_engine()

# Retrieve nodes
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)

# Generate question-context pairs
qa_dataset = generate_question_context_pairs(nodes, llm=gpt4, num_questions_per_chunk=2)

# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Create the RetrieverEvaluator with the extracted retriever
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

# Evaluate on the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    hit_rate = full_df["hit_rate"].mean()
    mrr = full_df["mrr"].mean()
    precision = full_df["precision"].mean()
    recall = full_df["recall"].mean()

    metric_df = pd.DataFrame({
        "retrievers": [name],
        "hit_rate": [hit_rate],
        "mrr": [mrr],
        "precision": [precision],
        "recall": [recall]
    })
    return metric_df

# Display the results
print(display_results("top-3 eval", eval_results))

In this example:

This setup ensures that you can evaluate retrieval with precision and recall, along with other metrics like hit rate and MRR [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 6 days ago

@dosu - in the RAGChecker, how can we create gt_answer and user_query from the nodes. I want to run the RAGCheker for set of user_query and gt_answer and get the response_object for all the set.

Also do we have to get the credentials to access the below AWS bedrock models, can we use any model available in azure services instead of below two:

extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-70b-instruct-v1:0"
dosubot[bot] commented 6 days ago

To create gt_answer and user_query from the nodes to run RAGChecker for a set of user queries and ground truth answers, you can use the KnowledgeGraphRAGRetriever and RetrieverQueryEngine from LlamaIndex. Additionally, you can use Azure OpenAI models instead of the AWS Bedrock models.

Here is an example of how to set up and use Azure OpenAI models with LlamaIndex:

from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever
from llama_index.core import StorageContext
from llama_index.graph_stores.nebula import NebulaGraphStore

# Set up Azure OpenAI
api_key = "<api-key>"
azure_endpoint = "https://<your-resource-name>.openai.azure.com/"
api_version = "2023-07-01-preview"

llm = AzureOpenAI(
    model="gpt-35-turbo-16k",
    deployment_name="my-custom-llm",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

embed_model = AzureOpenAIEmbedding(
    model="text-embedding-ada-002",
    deployment_name="my-custom-embedding",
    api_key=api_key,
    azure_endpoint=azure_endpoint,
    api_version=api_version,
)

Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512

# Set up NebulaGraphStore
space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], ["relationship"]
tags = ["entity"]

graph_store = NebulaGraphStore(
    space_name=space_name,
    edge_types=edge_types,
    rel_prop_names=rel_prop_names,
    tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)

# Set up KnowledgeGraphRAGRetriever
graph_rag_retriever = KnowledgeGraphRAGRetriever(
    storage_context=storage_context,
    verbose=True,
)

# Set up RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
    graph_rag_retriever,
)

# Example user query
user_query = "Tell me about Peter Quill?"

# Perform the query
response = query_engine.query(user_query)

# Display the response
from IPython.display import display, Markdown
display(Markdown(f"<b>{response}</b>"))

# Ground truth answer (example)
gt_answer = "Peter Quill, also known as Star-Lord, is a fictional character and superhero in the Marvel Cinematic Universe."

# Use the response and ground truth answer for RAGChecker

This setup allows you to use Azure OpenAI models for your LLM and embedding needs, and you can perform queries using the RetrieverQueryEngine with KnowledgeGraphRAGRetriever. You can then use the responses and ground truth answers to run RAGChecker.

Additionally, you can optimize your prompts for better RAG performance by following the "Optimization by Prompting" approach. This involves using a meta-prompt to generate better instruction prefixes for your QA prompt template, and evaluating the correctness of the generated answers against a ground-truth dataset. This process can be automated and iteratively improved using LlamaIndex's tools and OpenAI models [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 6 days ago

@dosu - what does this generate_question_context_pairs contains and how does RetrieverEvaluator uses this for evaluating precision and recall?

dosubot[bot] commented 6 days ago

The generate_question_context_pairs function generates a set of (question, context) pairs from a given unstructured text corpus using a language model (LLM). It returns an EmbeddingQAFinetuneDataset object containing queries, relevant_docs, and corpus.

Here's a brief overview of the function and its usage:

generate_question_context_pairs Function

This function generates question-context pairs from a list of nodes (text chunks). It uses an LLM to auto-generate questions from each context chunk.

from llama_index.core.evaluation import generate_question_context_pairs, EmbeddingQAFinetuneDataset

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

The EmbeddingQAFinetuneDataset object contains:

RetrieverEvaluator Class

The RetrieverEvaluator class uses the generated (question, context) pairs to evaluate retrieval performance. It calculates metrics such as precision and recall by comparing the retrieved results against the ground-truth context.

Precision Calculation

The Precision class calculates precision as follows:

  1. Convert retrieved_ids and expected_ids to sets.
  2. Compute the intersection of these sets.
  3. Divide the size of the intersection by the size of the retrieved_set.
retrieved_set = set(retrieved_ids)
expected_set = set(expected_ids)
precision = len(retrieved_set & expected_set) / len(retrieved_set)

Recall Calculation

The Recall class calculates recall as follows:

  1. Convert retrieved_ids and expected_ids to sets.
  2. Compute the intersection of these sets.
  3. Divide the size of the intersection by the size of the expected_set.
retrieved_set = set(retrieved_ids)
expected_set = set(expected_ids)
recall = len(retrieved_set & expected_set) / len(expected_set)

Example Usage

Here is an example of how to use RetrieverEvaluator to evaluate precision and recall:

from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

# Evaluation on the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

The aaevaluate_dataset method performs batch mode evaluation, which is more efficient than evaluating each query separately [1][2][3].

To continue talking to Dosu, mention @dosu.

adityamity commented 6 days ago

@dosu : As i checked the Precision Class , i want to understand from where does it take expected_ids and retrieved_ids. From where and how should i get these expected_ids and retrieved_ids.


class Precision(BaseRetrievalMetric):
    """Precision metric.

    The `K`-value in `Precision@K` usually corresponds to `top_k` of the retriever.

    Attributes:
        metric_name (str): The name of the metric.
    """

    metric_name: ClassVar[str] = "precision"

    def compute(
        self,
        query: Optional[str] = None,
        expected_ids: Optional[List[str]] = None,
        retrieved_ids: Optional[List[str]] = None,
        expected_texts: Optional[List[str]] = None,
        retrieved_texts: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> RetrievalMetricResult:
        """Compute precision based on the provided inputs and selected method.

        Parameters:
            query (Optional[str]): The query string (not used in the current implementation).
            expected_ids (Optional[List[str]]): Expected document IDs.
            retrieved_ids (Optional[List[str]]): Retrieved document IDs.
            expected_texts (Optional[List[str]]): Expected texts (not used in the current implementation).
            retrieved_texts (Optional[List[str]]): Retrieved texts (not used in the current implementation).

        Raises:
            ValueError: If the necessary IDs are not provided.

        Returns:
            RetrievalMetricResult: The result with the computed precision score.
        """
        # Checking for the required arguments
        if (
            retrieved_ids is None
            or expected_ids is None
            or not retrieved_ids
            or not expected_ids
        ):
            raise ValueError("Retrieved ids and expected ids must be provided")

        retrieved_set = set(retrieved_ids)
        expected_set = set(expected_ids)
        precision = len(retrieved_set & expected_set) / len(retrieved_set)

        return RetrievalMetricResult(score=precision)
dosubot[bot] commented 6 days ago

The Precision class in LlamaIndex obtains expected_ids and retrieved_ids as parameters to its compute method. These IDs are typically provided as input to the test cases, as seen in the test_precision function.

You should get these expected_ids and retrieved_ids from the data you are evaluating. The expected_ids are the ground truth or the correct set of IDs that you expect to retrieve, while the retrieved_ids are the set of IDs that your retrieval system actually retrieved.

Here is an example of how expected_ids and retrieved_ids are used in the test cases:

@pytest.mark.parametrize(
    ("expected_ids", "retrieved_ids", "expected_result"),
    [
        (["id1", "id2", "id3"], ["id3", "id1", "id2", "id4"], 3 / 4),
        (["id1", "id2", "id3", "id4"], ["id5", "id1"], 1 / 2),
        (["id1", "id2"], ["id3", "id4"], 0 / 2),
        (["id1", "id2"], ["id2", "id1", "id7"], 2 / 3),
        (["id1", "id2", "id3"], ["id3", "id1", "id2", "id4"], 3 / 4),
        (["id1", "id2", "id3", "id4"], ["id1", "id2", "id5"], 2 / 3),
        (["id1", "id2"], ["id1", "id7", "id15", "id2"], 2 / 4),
    ],
)
def test_precision(expected_ids, retrieved_ids, expected_result):
    prec = Precision()
    result = prec.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
    assert result.score == pytest.approx(expected_result)

In this example, expected_ids and retrieved_ids are provided as parameters to the test_precision function, which then uses them to compute the precision score.

The Precision class itself ensures that both expected_ids and retrieved_ids are provided and not empty before performing the computation. If these IDs are not provided, it raises a ValueError. The precision is then calculated as the ratio of the intersection of retrieved_ids and expected_ids to the total number of retrieved_ids [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 6 days ago

@dosu How can i create a data with expected_ids or ground truth and retrieved_ids. Do i need to use nodes to create this. Can you provide me the code to prepare data with expected_ids or ground truth and retrieved_ids with bunch of sample data.

dosubot[bot] commented 6 days ago

To create a dataset with expected_ids (ground truth) and retrieved_ids, you can use the following code. This code prepares data with expected_ids and retrieved_ids using sample data:

import uuid
import re
from llama_index.legacy.evaluation.retrieval.base import RetrievalEvalMode, RetrievalEvalResult, BaseRetrievalEvaluator
from llama_index.legacy.evaluation.retrieval.metrics import resolve_metrics
from llama_index.core.evaluation import LabelledQADataset, generate_question_context_pairs

QUERY_STR_TEMPLATE = "How can I sign a {symbol}?"

def asl_create_labelled_retrieval_dataset(
    reg_ex, nodes, mode
) -> LabelledQADataset:
    """Returns a QALabelledDataset that provides the expected node IDs
    for every query.

    NOTE: this is specific to the ASL use-case.
    """
    queries = {}
    relevant_docs = {}
    for node in nodes:
        # find the letter associated with the image/text node
        if mode == "image":
            string_to_search = node.metadata["file_path"]
        elif mode == "text":
            string_to_search = node.text
        else:
            raise ValueError(
                "Unsupported mode. Please enter 'image' or 'text'."
            )
        match = re.search(reg_ex, string_to_search)
        if match:
            # build the query
            query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
            id_ = str(uuid.uuid4())
            # store the query and expected ids pair
            queries[id_] = query
            relevant_docs[id_] = [node.id_]

    return LabelledQADataset(
        queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
    )

# Sample data
image_nodes = [
    # Add your image nodes here
]

text_nodes = [
    # Add your text nodes here
]

# labelled dataset for image retrieval
qa_dataset_image = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# labelled dataset for text retrieval
qa_dataset_text = asl_create_labelled_retrieval_dataset(
    r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)

# labelled dataset for text-desc
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
    r"(?:([A-Z]+).jpg)", image_nodes, "image"
)

# Using RetrieverEvaluator
class CustomRetrieverEvaluator(BaseRetrievalEvaluator):
    async def _aget_retrieved_ids_and_texts(
        self, query: str, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT
    ) -> Tuple[List[str], List[str]]:
        # Implement your retrieval logic here
        retrieved_ids = ["sample_id1", "sample_id2"]
        retrieved_texts = ["sample_text1", "sample_text2"]
        return retrieved_ids, retrieved_texts

retriever_evaluator = CustomRetrieverEvaluator.from_metric_names(
    ["mrr", "hit_rate"]
)

# Evaluate a single query
retrieval_result = retriever_evaluator.evaluate(
    query="How can I sign a B?", expected_ids=["node_id1", "node_id2"]
)
print(retrieval_result)

# Generate synthetic dataset
qa_dataset = generate_question_context_pairs(
    nodes=text_nodes, llm=llm, num_questions_per_chunk=2
)

# Evaluate the dataset in batch mode
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

This code defines a function asl_create_labelled_retrieval_dataset that creates a LabelledQADataset object with expected_ids (ground truth) and retrieved_ids using sample data. You can replace the sample data with your actual data. Additionally, it shows how to use RetrieverEvaluator to evaluate the dataset. The CustomRetrieverEvaluator class is a subclass of BaseRetrievalEvaluator and implements the _aget_retrieved_ids_and_texts method to provide custom retrieval logic [1][2][3].

To continue talking to Dosu, mention @dosu.

adityamity commented 5 days ago

@dosu How would i know that the evaluation is working fine on multi context queries.Is there any checks present. Please help me with the step by step process on evaluating Precision ,recall and other metrics considering that I have a retriever , nodes, vector_index available.

dosubot[bot] commented 5 days ago

To ensure that the evaluation is working correctly on multi-context queries and to evaluate precision, recall, and other metrics, you can follow these steps using the RetrieverEvaluator class from LlamaIndex. Here is a step-by-step process:

Step 1: Load and Parse Data

First, load and parse your data into nodes.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter

# Load and parse data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# Set consistent node IDs
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

Step 2: Initialize LLM and Vector Index

Initialize your language model (LLM) and create a vector index.

from llama_index.llms.openai import OpenAI

# Initialize LLM
llm = OpenAI(model="gpt-4")

# Create vector index
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)

Step 3: Generate Question-Context Pairs

Generate question-context pairs using the generate_question_context_pairs function.

from llama_index.core.evaluation import generate_question_context_pairs

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

Step 4: Define Metrics and Initialize Evaluator

Define the metrics you want to evaluate and initialize the RetrieverEvaluator.

from llama_index.core.evaluation import RetrieverEvaluator

# Define metrics
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

Step 5: Evaluate a Sample Query

Evaluate a sample query to ensure the setup is correct.

sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]

eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

Step 6: Evaluate the Entire Dataset

Evaluate the entire dataset to get comprehensive metrics.

import pandas as pd

# Evaluate the dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

# Display results
def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    columns = {
        "retrievers": [name],
        **{k: [full_df[k].mean()] for k in metrics},
    }

    metric_df = pd.DataFrame(columns)
    return metric_df

print(display_results("top-2 eval", eval_results))

Step 7: Ensure Evaluation is Working Correctly

To ensure the evaluation is working correctly, check the following:

  1. Metrics Calculation: Verify that the metrics (e.g., precision, recall) are correctly implemented and return expected values.
  2. Post-Processing: If using node post-processors, ensure they are correctly modifying the retrieved nodes.
  3. Retrieved Nodes: Check that the retrieved nodes (both IDs and texts) match the expected results for given queries.

Example Code

Here is the complete example code:

import nest_asyncio
import logging
import sys
import os
import pandas as pd

from llama_index.core.evaluation import (
    generate_question_context_pairs,
    RetrieverEvaluator
)
from llama_index.core import (
    SimpleDirectoryReader,
    VectorStoreIndex
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI

# Attach to the same event-loop
nest_asyncio.apply()

# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO)  # Set logger level to INFO

# Clear out any existing handlers
logger.handlers = []
# Set up the StreamHandler to output to sys.stdout
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO)  # Set handler level to INFO

# Add the handler to the logger
logger.addHandler(handler)

# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-..."

# Load and parse data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)

# Set consistent node IDs
for idx, node in enumerate(nodes):
    node.id_ = f"node_{idx}"

# Initialize LLM and vector index
llm = OpenAI(model="gpt-4")
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)

# Generate question-context pairs
qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

# Define metrics
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

# Evaluate a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)

# Evaluate the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

# Display results
def display_results(name, eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    full_df = pd.DataFrame(metric_dicts)
    columns = {
        "retrievers": [name],
        **{k: [full_df[k].mean()] for k in metrics},
    }

    metric_df = pd.DataFrame(columns)
    return metric_df

print(display_results("top-2 eval", eval_results))

This setup ensures that you can evaluate precision, recall, and other metrics for multi-context queries using a retriever, nodes, and vector_index [1][2][3].

To continue talking to Dosu, mention @dosu.

adityamity commented 5 days ago

@dosu, how can i save the eval_results as dataframe

dosubot[bot] commented 5 days ago

You can save the eval_results as a DataFrame by first converting the evaluation results into a list of dictionaries and then creating a DataFrame from that list. Here’s how you can do it:

import pandas as pd

# Function to convert eval_results to a DataFrame
def eval_results_to_dataframe(eval_results):
    metric_dicts = []
    for eval_result in eval_results:
        metric_dict = eval_result.metric_vals_dict
        metric_dicts.append(metric_dict)

    # Create a DataFrame from the list of dictionaries
    df = pd.DataFrame(metric_dicts)
    return df

# Convert eval_results to DataFrame
eval_results_df = eval_results_to_dataframe(eval_results)

# Save the DataFrame to a CSV file
eval_results_df.to_csv("eval_results.csv", index=False)

# Optionally, display the DataFrame
print(eval_results_df)

This code defines a function eval_results_to_dataframe that converts the eval_results into a DataFrame. It then saves the DataFrame to a CSV file named eval_results.csv. You can modify the file name and path as needed.

To continue talking to Dosu, mention @dosu.

adityamity commented 5 days ago

@dosu i want to convert eval_results to dataframe which should contain columns like query, expected_ids, expected_texts, retrieved_ids, retrieved_texts, metric_dict

dosubot[bot] commented 5 days ago

To convert eval_results to a DataFrame with columns like query, expected_ids, expected_texts, retrieved_ids, retrieved_texts, and metric_dict, you can use the following code:

import pandas as pd

# Assuming eval_results is a list of RetrievalEvalResult objects
def convert_eval_results_to_dataframe(eval_results):
    data = []
    for result in eval_results:
        data.append({
            'query': result.query,
            'expected_ids': result.expected_ids,
            'expected_texts': result.expected_texts,
            'retrieved_ids': result.retrieved_ids,
            'retrieved_texts': result.retrieved_texts,
            'metric_dict': result.metric_vals_dict
        })

    df = pd.DataFrame(data)
    return df

# Example usage
# eval_results = [...]  # List of RetrievalEvalResult objects
df = convert_eval_results_to_dataframe(eval_results)
print(df)

# Save the DataFrame to a CSV file
df.to_csv("eval_results.csv", index=False)

This code will create a DataFrame with the specified columns from the eval_results list. The metric_dict column will contain the dictionary of metric values for each evaluation result [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 5 days ago

@dosu : In the above steps , Can i generate questions using RAGAS testset generator and run the llama_index evalutor on the quesion generated by ragas:

generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

Change resulting question type distribution
distributions = {
    simple: 0.5,
    multi_context: 0.4,
    reasoning: 0.1
}

use generator.generate_with_llamaindex_docs if you use llama-index as document loader
testset = generator.generate_with_langchain_docs(documents, 10, distributions) 
testset.to_pandas()
dosubot[bot] commented 5 days ago

Yes, you can generate questions using the RAGAS testset generator and run the LlamaIndex evaluator on the questions generated by RAGAS. Here’s how you can integrate the RAGAS testset generator with LlamaIndex for evaluation:

Step 1: Generate Questions Using RAGAS Testset Generator

First, generate the testset using the RAGAS testset generator.

from ragas import TestsetGenerator
from llama_index.llms.openai import OpenAI

# Initialize LLMs and embeddings
generator_llm = OpenAI(model="gpt-4")
critic_llm = OpenAI(model="gpt-4")
embeddings = OpenAI(model="text-embedding-ada-002")

# Create the testset generator
generator = TestsetGenerator.from_langchain(
    generator_llm,
    critic_llm,
    embeddings
)

# Define the question type distribution
distributions = {
    "simple": 0.5,
    "multi_context": 0.4,
    "reasoning": 0.1
}

# Generate the testset
documents = [...]  # Your documents here
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset_df = testset.to_pandas()
print(testset_df)

Step 2: Prepare the LlamaIndex Evaluator

Next, prepare the LlamaIndex evaluator to evaluate the generated questions.

from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core import VectorStoreIndex

# Initialize your vector index and retriever
vector_index = VectorStoreIndex.from_documents(documents)
retriever = vector_index.as_retriever(similarity_top_k=2)

# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

Step 3: Evaluate the Generated Questions

Evaluate the generated questions using the LlamaIndex evaluator.

import pandas as pd

# Convert the testset to a format compatible with LlamaIndex
queries = testset_df['question'].tolist()
expected_ids = testset_df['expected_ids'].tolist()
expected_texts = testset_df['expected_texts'].tolist()

# Create a list to store evaluation results
eval_results = []

# Evaluate each query
for query, exp_ids, exp_texts in zip(queries, expected_ids, expected_texts):
    result = retriever_evaluator.evaluate(query, exp_ids)
    eval_results.append({
        'query': query,
        'expected_ids': exp_ids,
        'expected_texts': exp_texts,
        'retrieved_ids': result.retrieved_ids,
        'retrieved_texts': result.retrieved_texts,
        'metric_dict': result.metric_vals_dict
    })

# Convert evaluation results to a DataFrame
eval_results_df = pd.DataFrame(eval_results)
print(eval_results_df)

# Save the DataFrame to a CSV file
eval_results_df.to_csv("eval_results.csv", index=False)

This setup allows you to generate questions using the RAGAS testset generator and evaluate them using the LlamaIndex evaluator, capturing metrics such as precision, recall, and others [1][2][3][4].

To continue talking to Dosu, mention @dosu.

adityamity commented 5 days ago

@dosu which version of llama-index includes metrics like precision , recall. I am currently using Version: 0.10.13.post1 of llama-index but when i ran the code:


from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

i got the error as:

ValueError: Invalid metric name: precision

dosubot[bot] commented 5 days ago

The version of llama-index that includes metrics like precision and recall is 0.10.46. Since you are using Version 0.10.13.post1 and encountering a ValueError: Invalid metric name: precision, it indicates that this version does not include the precision and recall metrics. You need to upgrade to version 0.10.46 or later to access these metrics [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 day ago

@dosu in the above code where you have created testset_df using Ragas TestSetGenerator, i see the issue in the below line of code provided by you:

# Convert the testset to a format compatible with LlamaIndex
queries = testset_df['question'].tolist()
expected_ids = testset_df['expected_ids'].tolist()
expected_texts = testset_df['expected_texts'].tolist()

testset_df create by ragas TestSetGenerator does not have 'expected_ids' and 'expected_texts'. It contains only 'question', contexts','ground_truth','evolution_type,'metadata','episode_done'

dosubot[bot] commented 1 day ago

To convert the test set generated by RAGAS TestSetGenerator to a format compatible with LlamaIndex for evaluation, you need to map the fields from the RAGAS test set to the fields expected by LlamaIndex's LabelledEvaluatorDataExample. Specifically, you need to extract the expected_ids and expected_texts from the RAGAS test set.

Here is a Python function that performs this conversion:

from typing import List, Dict

def convert_ragas_to_llamaindex(ragas_testset: List[Dict]) -> List[Dict]:
    llamaindex_testset = []

    for item in ragas_testset:
        query = item['question']
        expected_ids = [context['id'] for context in item['contexts']]
        expected_texts = [context['text'] for context in item['contexts']]

        llamaindex_testset.append({
            'query': query,
            'expected_ids': expected_ids,
            'expected_texts': expected_texts,
            'retrieved_ids': [],  # This will be filled during evaluation
            'retrieved_texts': [],  # This will be filled during evaluation
            'mode': 'text',  # Assuming text mode
            'metric_dict': {}  # This will be filled during evaluation
        })

    return llamaindex_testset

# Example usage
ragas_testset = [
    {
        'question': 'What is the capital of France?',
        'contexts': [{'id': '1', 'text': 'Paris is the capital of France.'}],
        'ground_truth': 'Paris',
        'evolution_type': 'static',
        'metadata': {},
        'episode_done': True
    }
]

llamaindex_testset = convert_ragas_to_llamaindex(ragas_testset)
print(llamaindex_testset)

This function assumes that each context in the RAGAS test set has an id and text field. The expected_ids and expected_texts are extracted from these contexts and used to create the LlamaIndex-compatible format. The retrieved_ids, retrieved_texts, and metric_dict fields are initialized as empty and will be populated during the evaluation process. The mode is set to 'text' as per the LabelledEvaluatorDataExample class in LlamaIndex [1][2][3].

You can then use this converted test set with the RetrieverEvaluator for evaluation:

from llama_index.core.evaluation import RetrieverEvaluator
import pandas as pd

# Initialize your vector index and retriever
vector_index = VectorStoreIndex.from_documents(documents)
retriever = vector_index.as_retriever(similarity_top_k=2)

# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

# Evaluate the dataset
eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)

# Convert eval_results to DataFrame
def convert_eval_results_to_dataframe(eval_results):
    data = []
    for result in eval_results:
        data.append({
            'query': result.query,
            'expected_ids': result.expected_ids,
            'expected_texts': result.expected_texts,
            'retrieved_ids': result.retrieved_ids,
            'retrieved_texts': result.retrieved_texts,
            'metric_dict': result.metric_vals_dict
        })

    df = pd.DataFrame(data)
    return df

# Example usage
df = convert_eval_results_to_dataframe(eval_results)
print(df)

# Save the DataFrame to a CSV file
df.to_csv("eval_results.csv", index=False)

This will allow you to evaluate the questions generated by RAGAS using the LlamaIndex evaluator and save the results in a DataFrame.

To continue talking to Dosu, mention @dosu.

adityamity commented 1 day ago

@dosu the contexts columns created by RAGAS does not have 'id', it only contain list of answers.

dosubot[bot] commented 1 day ago

To convert the RAGAS test set, which contains a contexts column with a list of answers but no id, to a format compatible with LlamaIndex for evaluation, you can generate unique IDs for each context. Here’s how you can do it:

Step 1: Convert the RAGAS Test Set

First, convert the RAGAS test set to include unique IDs for each context.

import pandas as pd
import uuid

def convert_ragas_to_llamaindex(ragas_testset: pd.DataFrame) -> List[Dict]:
    llamaindex_testset = []

    for _, row in ragas_testset.iterrows():
        query = row['question']
        contexts = row['contexts']
        expected_ids = [str(uuid.uuid4()) for _ in contexts]
        expected_texts = contexts

        llamaindex_testset.append({
            'query': query,
            'expected_ids': expected_ids,
            'expected_texts': expected_texts,
            'retrieved_ids': [],  # This will be filled during evaluation
            'retrieved_texts': [],  # This will be filled during evaluation
            'mode': 'text',  # Assuming text mode
            'metric_dict': {}  # This will be filled during evaluation
        })

    return llamaindex_testset

# Example usage
ragas_testset = pd.DataFrame({
    'question': ['What is the capital of France?'],
    'contexts': [['Paris is the capital of France.']],
    'ground_truth': ['Paris'],
    'evolution_type': ['static'],
    'metadata': [{}],
    'episode_done': [True]
})

llamaindex_testset = convert_ragas_to_llamaindex(ragas_testset)
print(llamaindex_testset)

Step 2: Prepare the LlamaIndex Evaluator

Next, prepare the LlamaIndex evaluator to evaluate the generated questions.

from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core import VectorStoreIndex

# Initialize your vector index and retriever
documents = [...]  # Your documents here
vector_index = VectorStoreIndex.from_documents(documents)
retriever = vector_index.as_retriever(similarity_top_k=2)

# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)

Step 3: Evaluate the Generated Questions

Evaluate the generated questions using the LlamaIndex evaluator.

import pandas as pd

# Evaluate the dataset
eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)

# Convert eval_results to DataFrame
def convert_eval_results_to_dataframe(eval_results):
    data = []
    for result in eval_results:
        data.append({
            'query': result.query,
            'expected_ids': result.expected_ids,
            'expected_texts': result.expected_texts,
            'retrieved_ids': result.retrieved_ids,
            'retrieved_texts': result.retrieved_texts,
            'metric_dict': result.metric_vals_dict
        })

    df = pd.DataFrame(data)
    return df

# Example usage
df = convert_eval_results_to_dataframe(eval_results)
print(df)

# Save the DataFrame to a CSV file
df.to_csv("eval_results.csv", index=False)

This setup allows you to convert the RAGAS test set to a format compatible with LlamaIndex, evaluate it using the LlamaIndex evaluator, and save the results in a DataFrame [1][2][3].

To continue talking to Dosu, mention @dosu.

adityamity commented 1 day ago

@dosu getting below error: eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)

File /anaconda/lib/python3.10/site-packages/llama_index/core/evaluation/retrieval/base.py:188, in BaseRetrievalEvaluator.aevaluate_dataset(self, dataset, workers, show_progress, **kwargs)

AttributeError: 'list' object has no attribute 'mode'

dosubot[bot] commented 1 day ago

To resolve the AttributeError: 'list' object has no attribute 'mode' when running eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset), ensure that the dataset object passed to aevaluate_dataset is an instance of EmbeddingQAFinetuneDataset and has a mode attribute. The error suggests that dataset is a list, but it should be an instance of EmbeddingQAFinetuneDataset with a mode attribute.

Here's a snippet to illustrate the correct usage:

from llama_index.legacy.finetuning.embeddings.common import EmbeddingQAFinetuneDataset
from llama_index.legacy.evaluation.retrieval.base import RetrievalEvalMode

# Ensure llamaindex_testset is an instance of EmbeddingQAFinetuneDataset
# and has a 'mode' attribute
llamaindex_testset = EmbeddingQAFinetuneDataset(
    queries={item['query']: item['query'] for item in llamaindex_testset},
    relevant_docs={item['query']: item['expected_ids'] for item in llamaindex_testset},
    mode=RetrievalEvalMode.TEXT  # or RetrievalEvalMode.IMAGE, depending on your dataset
)

# Now run the evaluation
eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)

Make sure that llamaindex_testset is properly initialized and has the mode attribute set to either RetrievalEvalMode.TEXT or RetrievalEvalMode.IMAGE [1][2].

To continue talking to Dosu, mention @dosu.