run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
35.5k stars 5.01k forks source link

[Question]: Evaluating RAG retrieval #16034

Open adityamity opened 4 days ago

adityamity commented 4 days ago

Question Validation

Question

I want to evaluate the precision and recall of my RAG Application build on llama index.I am using the below code for evaluation:

from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
) 

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm_model, num_questions_per_chunk=4
)

from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

In the eval_results i see columns 'query','expected_ids','expected_texts','retrieved_ids','retrieved_texts','metric_dict'. I want to know from where the evaluator get the 'expected_ids' and 'retrieved_ids' and its 'retrieved_texts'?

does the 'expected_ids' here signifies the node_id?

dosubot[bot] commented 4 days ago

Yes, the 'expected_ids' in the eval_results signify the node_id. The evaluator gets the 'expected_ids', 'retrieved_ids', and 'retrieved_texts' from the following sources:

Here is the relevant code snippet:

def evaluate(
    self,
    query: str,
    expected_ids: List[str],
    expected_texts: Optional[List[str]] = None,
    mode: RetrievalEvalMode = RetrievalEvalMode.TEXT,
    **kwargs: Any,
) -> RetrievalEvalResult:
    """Run evaluation results with query string and expected ids.

    Args:
        query (str): Query string
        expected_ids (List[str]): Expected ids

    Returns:
        RetrievalEvalResult: Evaluation result

    """
    return asyncio.run(
        self.aevaluate(
            query=query,
            expected_ids=expected_ids,
            expected_texts=expected_texts,
            mode=mode,
            **kwargs,
        )
    )

async def aevaluate(
    self,
    query: str,
    expected_ids: List[str],
    expected_texts: Optional[List[str]] = None,
    mode: RetrievalEvalMode = RetrievalEvalMode.TEXT,
    **kwargs: Any,
) -> RetrievalEvalResult:
    """Run evaluation with query string, retrieved contexts,
    and generated response string.

    Subclasses can override this method to provide custom evaluation logic and
    take in additional arguments.
    """
    retrieved_ids, retrieved_texts = await self._aget_retrieved_ids_and_texts(
        query, mode
    )
    metric_dict = {}
    for metric in self.metrics:
        eval_result = metric.compute(
            query, expected_ids, retrieved_ids, expected_texts, retrieved_texts
        )
        metric_dict[metric.metric_name] = eval_result

    return RetrievalEvalResult(
        query=query,
        expected_ids=expected_ids,
        expected_texts=expected_texts,
        retrieved_ids=retrieved_ids,
        retrieved_texts=retrieved_texts,
        mode=mode,
        metric_dict=metric_dict,
    )

The RetrievalEvalResult class further clarifies the attributes:

class RetrievalEvalResult(BaseModel):
    """Retrieval eval result.

    NOTE: this abstraction might change in the future.

    Attributes:
        query (str): Query string
        expected_ids (List[str]): Expected ids
        retrieved_ids (List[str]): Retrieved ids
        metric_dict (Dict[str, BaseRetrievalMetric]): \
            Metric dictionary for the evaluation

    """

    model_config = ConfigDict(arbitrary_types_allowed=True)
    query: str = Field(..., description="Query string")
    expected_ids: List[str] = Field(..., description="Expected ids")
    expected_texts: Optional[List[str]] = Field(
        default=None,
        description="Expected texts associated with nodes provided in `expected_ids`",
    )
    retrieved_ids: List[str] = Field(..., description="Retrieved ids")
    retrieved_texts: List[str] = Field(..., description="Retrieved texts")
    mode: "RetrievalEvalMode" = Field(
        default=RetrievalEvalMode.TEXT, description="text or image"
    )
    metric_dict: Dict[str, RetrievalMetricResult] = Field(
        ..., description="Metric dictionary for the evaluation"
    )

    @property
    def metric_vals_dict(self) -> Dict[str, float]:
        """Dictionary of metric values."""
        return {k: v.score for k, v in self.metric_dict.items()}

    def __str__(self) -> str:
        """String representation."""
        return f"Query: {self.query}\n" f"Metrics: {self.metric_vals_dict!s}\n"

[1][2]

To continue talking to Dosu, mention @dosu.

logan-markewich commented 4 days ago

Yes expected ids is node ids

The generate pairs function is just generating a question for a given chunk, and querying with that question should retrieve that chunk

adityamity commented 3 days ago

@dosu how does the llama index evaluate the precision and recall using the expected_id and retrieved_ids. I see the below code for calculation :

class Precision(BaseRetrievalMetric):
    """Precision metric.

    The `K`-value in `Precision@K` usually corresponds to `top_k` of the retriever.

    Attributes:
        metric_name (str): The name of the metric.
    """

    metric_name: ClassVar[str] = "precision"

    def compute(
        self,
        query: Optional[str] = None,
        expected_ids: Optional[List[str]] = None,
        retrieved_ids: Optional[List[str]] = None,
        expected_texts: Optional[List[str]] = None,
        retrieved_texts: Optional[List[str]] = None,
        **kwargs: Any,
    ) -> RetrievalMetricResult:
        """Compute precision based on the provided inputs and selected method.

        Parameters:
            query (Optional[str]): The query string (not used in the current implementation).
            expected_ids (Optional[List[str]]): Expected document IDs.
            retrieved_ids (Optional[List[str]]): Retrieved document IDs.
            expected_texts (Optional[List[str]]): Expected texts (not used in the current implementation).
            retrieved_texts (Optional[List[str]]): Retrieved texts (not used in the current implementation).

        Raises:
            ValueError: If the necessary IDs are not provided.

        Returns:
            RetrievalMetricResult: The result with the computed precision score.
        """
        # Checking for the required arguments
        if (
            retrieved_ids is None
            or expected_ids is None
            or not retrieved_ids
            or not expected_ids
        ):
            raise ValueError("Retrieved ids and expected ids must be provided")

        retrieved_set = set(retrieved_ids)
        expected_set = set(expected_ids)
        precision = len(retrieved_set & expected_set) / len(retrieved_set)

        return RetrievalMetricResult(score=precision)

Does the retrieve_id is extracted using the RAG query_engine, if yes, then how can we improve the score, i see for many of the expected_id, the retrieved_id is more in numbers.

dosubot[bot] commented 3 days ago

The Llama Index evaluates precision and recall using the expected_ids and retrieved_ids as follows:

Precision

Precision is calculated as the ratio of the number of relevant documents retrieved to the total number of documents retrieved. The formula used is:

precision = len(set(expected_ids) & set(retrieved_ids)) / len(retrieved_ids)

Recall

Recall is calculated as the ratio of the number of relevant documents retrieved to the total number of relevant documents. The formula used is:

recall = len(set(expected_ids) & set(retrieved_ids)) / len(expected_ids)

Example Code

Here is the code for computing precision and recall:

def test_precision(expected_ids, retrieved_ids, expected_result):
    prec = Precision()
    result = prec.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
    assert result.score == pytest.approx(expected_result)

def test_recall(expected_ids, retrieved_ids, expected_result):
    recall = Recall()
    result = recall.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
    assert result.score == pytest.approx(expected_result)

Improving the Score with RAG query_engine

To improve the precision and recall scores when using the RAG query_engine, you can:

  1. Enhance the Quality of the Query: Ensure that the query is well-formed and specific to retrieve more relevant documents.
  2. Optimize the Retrieval Algorithm: Fine-tune the retrieval algorithm parameters to better match the expected results.
  3. Increase the Size of the Knowledge Base: A larger and more comprehensive knowledge base can improve the chances of retrieving relevant documents.
  4. Use Feedback Loops: Implement feedback mechanisms to iteratively improve the retrieval process based on previous results.
  5. Utilize Additional Metrics: Incorporate other metrics such as Hit Rate, Mean Reciprocal Rank (MRR), and Average Precision (AP) to get a more comprehensive evaluation and identify areas for improvement.

By focusing on these areas, you can potentially improve the precision and recall scores of the retrieved documents [1][2].

To continue talking to Dosu, mention @dosu.

adityamity commented 3 days ago

@dosu How can we limit the number of retrieved_id? if we have more retrieved_id then it will affect the precision and recall score?

dosubot[bot] commented 3 days ago

To limit the number of retrieved_ids, you can adjust the top_k parameter in your retriever settings. This parameter controls the maximum number of documents to retrieve for each query.

Here's an example of how to set the top_k parameter:

retriever = YourRetrieverClass(top_k=10)  # Set top_k to the desired number of retrieved documents
retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

Adjust the top_k value to the number of documents you want to retrieve. This will limit the number of retrieved_ids accordingly.

To continue talking to Dosu, mention @dosu.

adityamity commented 3 days ago

@dosu if we have more retrieved_id then will it affect the precision and recall score?

dosubot[bot] commented 3 days ago

Yes, having more retrieved_ids can affect both precision and recall scores.

Precision

Precision is calculated as the number of relevant retrieved documents divided by the total number of retrieved documents. If you retrieve more documents, the precision can decrease if the additional documents are not relevant.

Recall

Recall is calculated as the number of relevant retrieved documents divided by the total number of relevant documents. If you retrieve more documents, the recall can increase if the additional documents include more relevant documents.

Example

For example, if you have the following:

Thus, increasing the number of retrieved_ids can decrease precision if the additional documents are not relevant, and it can increase recall if the additional documents include more relevant documents [1].

To continue talking to Dosu, mention @dosu.

adityamity commented 3 days ago

@dosu why my retrieve_id length is always 20. I see 20 node_id in all my retrieve_id however i have not specify these numbers anywhere

dosubot[bot] commented 3 days ago

The length of your retrieved_ids is always 20 because the retrieve method in the RecursiveRetriever class calls the _retrieve_rec method, which in turn calls the retrieve method of the BaseRetriever class. The BaseRetriever class has a default behavior of retrieving 20 nodes if not specified otherwise.

Here is the relevant code snippet from the RecursiveRetriever class:

def _retrieve_rec(
    self,
    query_bundle: QueryBundle,
    query_id: Optional[str] = None,
    cur_similarity: Optional[float] = None,
) -> Tuple[List[NodeWithScore], List[NodeWithScore]]:
    """Query recursively."""
    if self._verbose:
        print_text(
            f"Retrieving with query id {query_id}: {query_bundle.query_str}\n",
            color="blue",
        )
    query_id = query_id or self._root_id
    cur_similarity = cur_similarity or 1.0

    obj = self._get_object(query_id)
    if isinstance(obj, BaseNode):
        nodes_to_add = [NodeWithScore(node=obj, score=cur_similarity)]
        additional_nodes: List[NodeWithScore] = []
    elif isinstance(obj, BaseRetriever):
        with self.callback_manager.event(
            CBEventType.RETRIEVE,
            payload={EventPayload.QUERY_STR: query_bundle.query_str},
        ) as event:
            nodes = obj.retrieve(query_bundle)
            event.on_end(payload={EventPayload.NODES: nodes})

        nodes_to_add, additional_nodes = self._query_retrieved_nodes(
            query_bundle, nodes
        )

    elif isinstance(obj, BaseQueryEngine):
        sub_resp = obj.query(query_bundle)
        if self._verbose:
            print_text(
                f"Got response: {sub_resp!s}\n",
                color="green",
            )
        # format with both the query and the response
        node_text = self._query_response_tmpl.format(
            query_str=query_bundle.query_str, response=str(sub_resp)
        )
        node = TextNode(text=node_text)
        nodes_to_add = [NodeWithScore(node=node, score=cur_similarity)]
        additional_nodes = sub_resp.source_nodes
    else:
        raise ValueError("Must be a retriever or query engine.")

    return nodes_to_add, additional_nodes

The retrieve method of the BaseRetriever class is likely configured to return 20 nodes by default. To change this behavior, you would need to modify the retrieve method in the BaseRetriever class or pass a parameter to specify a different number of nodes to retrieve [1].

To continue talking to Dosu, mention @dosu.

adityamity commented 3 days ago

@dosu @dosu below code i use for evaluating["hit_rate", "mrr" , "precision", "recall", "ap", "ndcg"]

from llama_index.core.evaluation import (
    generate_question_context_pairs,
    EmbeddingQAFinetuneDataset,
) 

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm_model, num_questions_per_chunk=4
)

from llama_index.core.evaluation import RetrieverEvaluator

metrics = ["hit_rate", "mrr"]#, "precision", "recall", "ap", "ndcg"]

retriever_evaluator = RetrieverEvaluator.from_metric_names(
    metrics, retriever=retriever
)

eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)

in the above qa_dataset, how would i ensure that the queries belong to multi context queries?

dosubot[bot] commented 3 days ago

To ensure that the queries belong to multi-context queries in the qa_dataset, you can modify the get_query_context_lists function to include multiple relevant contexts for each query. Here is an updated version of the function:

def get_query_context_lists(
    query_context_pairs: EmbeddingQAFinetuneDataset,
) -> Tuple[List[str], List[List[str]]]:
    queries = []
    relevant_contexts = []

    # 'query_context_pairs' is an object with 'queries', 'corpus', and 'relevant_docs' attributes
    for query_id, query in query_context_pairs.queries.items():
        # Get all relevant document IDs for the current query
        relevant_doc_ids = query_context_pairs.relevant_docs[query_id]
        # Get the relevant contexts using the relevant document IDs
        relevant_context = [query_context_pairs.corpus[doc_id] for doc_id in relevant_doc_ids]
        # Append the query and the relevant contexts to their respective lists
        queries.append(query)
        relevant_contexts.append(relevant_context)

    return queries, relevant_contexts

This modification ensures that each query is associated with multiple relevant contexts, making them multi-context queries. The EmbeddingQAFinetuneDataset class, as defined, supports this structure by having a relevant_docs attribute that maps each query ID to a list of document IDs, allowing for multiple contexts per query.

Additionally, you can use the generate_question_context_pairs function from LlamaIndex to generate a set of (question, context) pairs over a given unstructured text corpus. This function uses an LLM to auto-generate questions from each context chunk, ensuring that the queries are associated with multiple contexts.

from llama_index.core.evaluation import generate_question_context_pairs, EmbeddingQAFinetuneDataset

qa_dataset = generate_question_context_pairs(
    nodes, llm=llm, num_questions_per_chunk=2
)

This approach leverages synthetic data generation to create a robust evaluation dataset with multi-context queries [1][2][3][4].

To continue talking to Dosu, mention @dosu.