Open adityamity opened 4 days ago
Yes, the 'expected_ids' in the eval_results
signify the node_id. The evaluator gets the 'expected_ids', 'retrieved_ids', and 'retrieved_texts' from the following sources:
expected_ids
are provided as an argument to the evaluate
or aevaluate
methods.retrieved_ids
and retrieved_texts
are obtained by calling the _aget_retrieved_ids_and_texts
method within the aevaluate
method.Here is the relevant code snippet:
def evaluate(
self,
query: str,
expected_ids: List[str],
expected_texts: Optional[List[str]] = None,
mode: RetrievalEvalMode = RetrievalEvalMode.TEXT,
**kwargs: Any,
) -> RetrievalEvalResult:
"""Run evaluation results with query string and expected ids.
Args:
query (str): Query string
expected_ids (List[str]): Expected ids
Returns:
RetrievalEvalResult: Evaluation result
"""
return asyncio.run(
self.aevaluate(
query=query,
expected_ids=expected_ids,
expected_texts=expected_texts,
mode=mode,
**kwargs,
)
)
async def aevaluate(
self,
query: str,
expected_ids: List[str],
expected_texts: Optional[List[str]] = None,
mode: RetrievalEvalMode = RetrievalEvalMode.TEXT,
**kwargs: Any,
) -> RetrievalEvalResult:
"""Run evaluation with query string, retrieved contexts,
and generated response string.
Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
retrieved_ids, retrieved_texts = await self._aget_retrieved_ids_and_texts(
query, mode
)
metric_dict = {}
for metric in self.metrics:
eval_result = metric.compute(
query, expected_ids, retrieved_ids, expected_texts, retrieved_texts
)
metric_dict[metric.metric_name] = eval_result
return RetrievalEvalResult(
query=query,
expected_ids=expected_ids,
expected_texts=expected_texts,
retrieved_ids=retrieved_ids,
retrieved_texts=retrieved_texts,
mode=mode,
metric_dict=metric_dict,
)
The RetrievalEvalResult
class further clarifies the attributes:
class RetrievalEvalResult(BaseModel):
"""Retrieval eval result.
NOTE: this abstraction might change in the future.
Attributes:
query (str): Query string
expected_ids (List[str]): Expected ids
retrieved_ids (List[str]): Retrieved ids
metric_dict (Dict[str, BaseRetrievalMetric]): \
Metric dictionary for the evaluation
"""
model_config = ConfigDict(arbitrary_types_allowed=True)
query: str = Field(..., description="Query string")
expected_ids: List[str] = Field(..., description="Expected ids")
expected_texts: Optional[List[str]] = Field(
default=None,
description="Expected texts associated with nodes provided in `expected_ids`",
)
retrieved_ids: List[str] = Field(..., description="Retrieved ids")
retrieved_texts: List[str] = Field(..., description="Retrieved texts")
mode: "RetrievalEvalMode" = Field(
default=RetrievalEvalMode.TEXT, description="text or image"
)
metric_dict: Dict[str, RetrievalMetricResult] = Field(
..., description="Metric dictionary for the evaluation"
)
@property
def metric_vals_dict(self) -> Dict[str, float]:
"""Dictionary of metric values."""
return {k: v.score for k, v in self.metric_dict.items()}
def __str__(self) -> str:
"""String representation."""
return f"Query: {self.query}\n" f"Metrics: {self.metric_vals_dict!s}\n"
To continue talking to Dosu, mention @dosu.
Yes expected ids is node ids
The generate pairs function is just generating a question for a given chunk, and querying with that question should retrieve that chunk
@dosu how does the llama index evaluate the precision and recall using the expected_id
and retrieved_ids
. I see the below code for calculation :
class Precision(BaseRetrievalMetric):
"""Precision metric.
The `K`-value in `Precision@K` usually corresponds to `top_k` of the retriever.
Attributes:
metric_name (str): The name of the metric.
"""
metric_name: ClassVar[str] = "precision"
def compute(
self,
query: Optional[str] = None,
expected_ids: Optional[List[str]] = None,
retrieved_ids: Optional[List[str]] = None,
expected_texts: Optional[List[str]] = None,
retrieved_texts: Optional[List[str]] = None,
**kwargs: Any,
) -> RetrievalMetricResult:
"""Compute precision based on the provided inputs and selected method.
Parameters:
query (Optional[str]): The query string (not used in the current implementation).
expected_ids (Optional[List[str]]): Expected document IDs.
retrieved_ids (Optional[List[str]]): Retrieved document IDs.
expected_texts (Optional[List[str]]): Expected texts (not used in the current implementation).
retrieved_texts (Optional[List[str]]): Retrieved texts (not used in the current implementation).
Raises:
ValueError: If the necessary IDs are not provided.
Returns:
RetrievalMetricResult: The result with the computed precision score.
"""
# Checking for the required arguments
if (
retrieved_ids is None
or expected_ids is None
or not retrieved_ids
or not expected_ids
):
raise ValueError("Retrieved ids and expected ids must be provided")
retrieved_set = set(retrieved_ids)
expected_set = set(expected_ids)
precision = len(retrieved_set & expected_set) / len(retrieved_set)
return RetrievalMetricResult(score=precision)
Does the retrieve_id
is extracted using the RAG query_engine
, if yes, then how can we improve the score, i see for many of the expected_id, the retrieved_id is more in numbers.
The Llama Index evaluates precision and recall using the expected_ids
and retrieved_ids
as follows:
Precision is calculated as the ratio of the number of relevant documents retrieved to the total number of documents retrieved. The formula used is:
precision = len(set(expected_ids) & set(retrieved_ids)) / len(retrieved_ids)
Recall is calculated as the ratio of the number of relevant documents retrieved to the total number of relevant documents. The formula used is:
recall = len(set(expected_ids) & set(retrieved_ids)) / len(expected_ids)
Here is the code for computing precision and recall:
def test_precision(expected_ids, retrieved_ids, expected_result):
prec = Precision()
result = prec.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
assert result.score == pytest.approx(expected_result)
def test_recall(expected_ids, retrieved_ids, expected_result):
recall = Recall()
result = recall.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
assert result.score == pytest.approx(expected_result)
query_engine
To improve the precision and recall scores when using the RAG query_engine
, you can:
By focusing on these areas, you can potentially improve the precision and recall scores of the retrieved documents [1][2].
To continue talking to Dosu, mention @dosu.
@dosu How can we limit the number of retrieved_id
? if we have more retrieved_id then it will affect the precision and recall score?
To limit the number of retrieved_ids
, you can adjust the top_k
parameter in your retriever settings. This parameter controls the maximum number of documents to retrieve for each query.
Here's an example of how to set the top_k
parameter:
retriever = YourRetrieverClass(top_k=10) # Set top_k to the desired number of retrieved documents
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
Adjust the top_k
value to the number of documents you want to retrieve. This will limit the number of retrieved_ids
accordingly.
To continue talking to Dosu, mention @dosu.
@dosu if we have more retrieved_id then will it affect the precision and recall score?
Yes, having more retrieved_ids
can affect both precision and recall scores.
Precision is calculated as the number of relevant retrieved documents divided by the total number of retrieved documents. If you retrieve more documents, the precision can decrease if the additional documents are not relevant.
Recall is calculated as the number of relevant retrieved documents divided by the total number of relevant documents. If you retrieve more documents, the recall can increase if the additional documents include more relevant documents.
For example, if you have the following:
expected_ids = ["id1", "id2"]
retrieved_ids = ["id1", "id7", "id15", "id2"]
Precision: ( \frac{2}{4} = 0.5 ) (2 relevant documents out of 4 retrieved)
Recall: ( \frac{2}{2} = 1.0 ) (2 relevant documents retrieved out of 2 total relevant documents)
Thus, increasing the number of retrieved_ids
can decrease precision if the additional documents are not relevant, and it can increase recall if the additional documents include more relevant documents [1].
To continue talking to Dosu, mention @dosu.
@dosu why my retrieve_id length is always 20. I see 20 node_id in all my retrieve_id however i have not specify these numbers anywhere
The length of your retrieved_ids
is always 20 because the retrieve
method in the RecursiveRetriever
class calls the _retrieve_rec
method, which in turn calls the retrieve
method of the BaseRetriever
class. The BaseRetriever
class has a default behavior of retrieving 20 nodes if not specified otherwise.
Here is the relevant code snippet from the RecursiveRetriever
class:
def _retrieve_rec(
self,
query_bundle: QueryBundle,
query_id: Optional[str] = None,
cur_similarity: Optional[float] = None,
) -> Tuple[List[NodeWithScore], List[NodeWithScore]]:
"""Query recursively."""
if self._verbose:
print_text(
f"Retrieving with query id {query_id}: {query_bundle.query_str}\n",
color="blue",
)
query_id = query_id or self._root_id
cur_similarity = cur_similarity or 1.0
obj = self._get_object(query_id)
if isinstance(obj, BaseNode):
nodes_to_add = [NodeWithScore(node=obj, score=cur_similarity)]
additional_nodes: List[NodeWithScore] = []
elif isinstance(obj, BaseRetriever):
with self.callback_manager.event(
CBEventType.RETRIEVE,
payload={EventPayload.QUERY_STR: query_bundle.query_str},
) as event:
nodes = obj.retrieve(query_bundle)
event.on_end(payload={EventPayload.NODES: nodes})
nodes_to_add, additional_nodes = self._query_retrieved_nodes(
query_bundle, nodes
)
elif isinstance(obj, BaseQueryEngine):
sub_resp = obj.query(query_bundle)
if self._verbose:
print_text(
f"Got response: {sub_resp!s}\n",
color="green",
)
# format with both the query and the response
node_text = self._query_response_tmpl.format(
query_str=query_bundle.query_str, response=str(sub_resp)
)
node = TextNode(text=node_text)
nodes_to_add = [NodeWithScore(node=node, score=cur_similarity)]
additional_nodes = sub_resp.source_nodes
else:
raise ValueError("Must be a retriever or query engine.")
return nodes_to_add, additional_nodes
The retrieve
method of the BaseRetriever
class is likely configured to return 20 nodes by default. To change this behavior, you would need to modify the retrieve
method in the BaseRetriever
class or pass a parameter to specify a different number of nodes to retrieve [1].
To continue talking to Dosu, mention @dosu.
@dosu @dosu below code i use for evaluating["hit_rate", "mrr" , "precision", "recall", "ap", "ndcg"]
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
qa_dataset = generate_question_context_pairs(
nodes, llm=llm_model, num_questions_per_chunk=4
)
from llama_index.core.evaluation import RetrieverEvaluator
metrics = ["hit_rate", "mrr"]#, "precision", "recall", "ap", "ndcg"]
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
in the above qa_dataset
, how would i ensure that the queries belong to multi context queries?
To ensure that the queries belong to multi-context queries in the qa_dataset
, you can modify the get_query_context_lists
function to include multiple relevant contexts for each query. Here is an updated version of the function:
def get_query_context_lists(
query_context_pairs: EmbeddingQAFinetuneDataset,
) -> Tuple[List[str], List[List[str]]]:
queries = []
relevant_contexts = []
# 'query_context_pairs' is an object with 'queries', 'corpus', and 'relevant_docs' attributes
for query_id, query in query_context_pairs.queries.items():
# Get all relevant document IDs for the current query
relevant_doc_ids = query_context_pairs.relevant_docs[query_id]
# Get the relevant contexts using the relevant document IDs
relevant_context = [query_context_pairs.corpus[doc_id] for doc_id in relevant_doc_ids]
# Append the query and the relevant contexts to their respective lists
queries.append(query)
relevant_contexts.append(relevant_context)
return queries, relevant_contexts
This modification ensures that each query is associated with multiple relevant contexts, making them multi-context queries. The EmbeddingQAFinetuneDataset
class, as defined, supports this structure by having a relevant_docs
attribute that maps each query ID to a list of document IDs, allowing for multiple contexts per query.
Additionally, you can use the generate_question_context_pairs
function from LlamaIndex to generate a set of (question, context) pairs over a given unstructured text corpus. This function uses an LLM to auto-generate questions from each context chunk, ensuring that the queries are associated with multiple contexts.
from llama_index.core.evaluation import generate_question_context_pairs, EmbeddingQAFinetuneDataset
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
This approach leverages synthetic data generation to create a robust evaluation dataset with multi-context queries [1][2][3][4].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
I want to evaluate the precision and recall of my RAG Application build on llama index.I am using the below code for evaluation:
In the
eval_results
i see columns'query','expected_ids','expected_texts','retrieved_ids','retrieved_texts','metric_dict'
. I want to know from where the evaluator get the 'expected_ids' and 'retrieved_ids' and its 'retrieved_texts'?does the '
expected_ids
' here signifies the node_id?