Open adityamity opened 1 week ago
The EmbeddingQAFinetuneDataset
object returned by generate_question_context_pairs
contains 'queries', 'corpus', and 'relevant_docs', but not 'context'. The evaluation using RetrieverEvaluator
is performed using the 'queries' and 'relevant_docs' fields. The 'queries' represent the questions, and 'relevant_docs' are the corresponding relevant document IDs [1].
To generate ground_truth
along with context
, you can use the LabelledEvaluatorDataExample
class, which includes fields for ground_truth_answer
and contexts
. This class allows for a structured way to generate and evaluate ground_truth
along with context
within the LlamaIndex framework [2][3].
Regarding the support for evaluation metrics, the RetrieverEvaluator
class supports metrics such as 'hit_rate', 'mrr', 'precision', 'recall', 'ap', 'ndcg', and 'cohere_rerank_relevancy' (optional). However, if you encounter a ValueError
for 'precision' and 'recall', it might be due to version-specific changes. The latest versions of LlamaIndex should support these metrics, and the test cases validate their computation [4][5][6].
Ensure you are using the latest version of LlamaIndex to access all supported metrics. If you are still facing issues, consider updating to the most recent version.
To continue talking to Dosu, mention @dosu.
@dosu : In the below code, i am passing my query_engine
from llama_index.core.evaluation import RetrieverEvaluator
metrics = ["hit_rate", "mrr"]#, "precision", "recall", "ap", "ndcg"]
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=query_engine
)
my query_engine
type is llama_index.core.query_engine.retriever_query_engine.RetrieverQueryEngine
.
so i am getting error :
ValidationError: 1 validation error for RetrieverEvaluator
retriever
instance of BaseRetriever expected (type=type_error.arbitrary_type; expected_arbitrary_type=BaseRetriever)
The error you're encountering is due to the fact that RetrieverEvaluator
expects an instance of BaseRetriever
, not RetrieverQueryEngine
. You need to pass a BaseRetriever
object to the RetrieverEvaluator
.
Here's how you can modify your code to create a BaseRetriever
from your RetrieverQueryEngine
:
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core.query_engine.retriever_query_engine import RetrieverQueryEngine
# Assuming you have a RetrieverQueryEngine instance
query_engine = RetrieverQueryEngine(...)
# Extract the retriever from the query engine
retriever = query_engine._retriever
# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Create the RetrieverEvaluator with the extracted retriever
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
This should resolve the ValidationError
by ensuring that the RetrieverEvaluator
receives a BaseRetriever
instance as expected [1].
To continue talking to Dosu, mention @dosu.
@dosu I am using the below code for evaluating ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
:
from llama_index.core.evaluation import (
generate_question_context_pairs,
EmbeddingQAFinetuneDataset,
)
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
from llama_index.core.evaluation import RetrieverEvaluator
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
In the above code, qa_dataset contains only 'queries', 'corpus' and 'relevant_docs'. Which is not fulfilling my requirement as I want to test my RAG based on ground_truth , context and answer generated by RAG.
Can you please provide me the end to end code to perform such test as i want to mainly focus on evaluating Context precision and context recall. It would be great if you can provide me the code in the same way I provided my code.
Here's an end-to-end code example to evaluate a RAG retrieval service based on ground_truth
, context
, and answer
, focusing on context precision and context recall using the LlamaIndex framework:
import nest_asyncio
import logging
import sys
import os
import pandas as pd
from llama_index.core.evaluation import (
DatasetGenerator,
FaithfulnessEvaluator,
RelevancyEvaluator,
CorrectnessEvaluator,
RetrieverEvaluator,
generate_question_context_pairs,
)
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
Response,
)
from llama_index.llms.openai import OpenAI
# Attach to the same event-loop
nest_asyncio.apply()
# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # Set logger level to INFO
# Clear out any existing handlers
logger.handlers = []
# Set up the StreamHandler to output to sys.stdout
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO) # Set handler level to INFO
# Add the handler to the logger
logger.addHandler(handler)
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-..."
# Download and load data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()
# Generate questions
gpt4 = OpenAI(model="gpt-4", temperature=0.1)
dataset_generator = DatasetGenerator.from_documents(documents, llm=gpt4, show_progress=True)
eval_dataset = dataset_generator.generate_dataset_from_nodes(num=20)
eval_queries = list(eval_dataset.queries.values())
# Fix evaluation query
eval_query = "How did the author describe their early attempts at writing short stories?"
# Fix GPT-3.5-TURBO LLM for generating response
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
# Create vector index
vector_index = VectorStoreIndex.from_documents(documents, llm=gpt35)
query_engine = vector_index.as_query_engine()
# Retrieve nodes
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)
# Faithfulness Evaluator
faithfulness_evaluator = FaithfulnessEvaluator(llm=gpt4)
response_vector = query_engine.query(eval_query)
eval_result_faithfulness = faithfulness_evaluator.evaluate_response(response=response_vector)
# Relevancy Evaluator
relevancy_evaluator = RelevancyEvaluator(llm=gpt4)
eval_result_relevancy = relevancy_evaluator.evaluate_response(query=eval_query, response=response_vector)
# Correctness Evaluator
correctness_evaluator = CorrectnessEvaluator(llm=gpt4)
query = "Can you explain the theory of relativity proposed by Albert Einstein in detail?"
reference = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).
General relativity, published in 1915, extended these ideas to include the effects of gravity. According to general relativity, gravity is not a force between masses, as described by Newton's theory of gravity, but rather the result of the warping of space and time by mass and energy. Massive objects, such as planets and stars, cause a curvature in spacetime, and smaller objects follow curved paths in response to this curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet, causing it to create a depression that other objects (representing smaller masses) naturally move towards.
In essence, general relativity provided a new understanding of gravity, explaining phenomena like the bending of light by gravity (gravitational lensing) and the precession of the orbit of Mercury. It has been confirmed through numerous experiments and observations and has become a fundamental theory in modern physics.
"""
response = """
Certainly! Albert Einstein's theory of relativity consists of two main components: special relativity and general relativity. Special relativity, published in 1905, introduced the concept that the laws of physics are the same for all non-accelerating observers and that the speed of light in a vacuum is a constant, regardless of the motion of the source or observer. It also gave rise to the famous equation E=mc², which relates energy (E) and mass (m).
However, general relativity, published in 1915, extended these ideas to include the effects of magnetism. According to general relativity, gravity is not a force between masses but rather the result of the warping of space and time by magnetic fields generated by massive objects. Massive objects, such as planets and stars, create magnetic fields that cause a curvature in spacetime, and smaller objects follow curved paths in response to this magnetic curvature. This concept is often illustrated using the analogy of a heavy ball placed on a rubber sheet with magnets underneath, causing it to create a depression that other objects (representing smaller masses) naturally move towards due to magnetic attraction.
"""
correctness_result = correctness_evaluator.evaluate(query=query, response=response, reference=reference)
# Retrieval Evaluation
from llama_index.core.text_splitter import SentenceSplitter
parser = SentenceSplitter(chunk_size=1024, chunk_overlap=100)
nodes = parser(documents)
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)
retrieved_nodes = retriever.retrieve(eval_query)
qa_dataset = generate_question_context_pairs(nodes, llm=gpt4, num_questions_per_chunk=2)
retriever_evaluator = RetrieverEvaluator.from_metric_names(["mrr", "hit_rate"], retriever=retriever)
# Evaluate on a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]
eval_result_retriever = retriever_evaluator.evaluate(sample_query, sample_expected)
# Evaluate on the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
def display_results(name, eval_results):
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()
metric_df = pd.DataFrame({"retrievers": [name], "hit_rate": [hit_rate], "mrr": [mrr]})
return metric_df
display_results("top-2 eval", eval_results)
Additionally, you can leverage the RAGChecker framework for a more comprehensive evaluation of your RAG system:
# Install necessary packages
%pip install -qU ragchecker llama-index
# Import necessary libraries
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from ragchecker.integrations.llama_index import response_to_rag_results
from ragchecker import RAGResults, RAGChecker
from ragchecker.metrics import all_metrics
# Load documents
documents = SimpleDirectoryReader("path/to/your/documents").load_data()
# Create index
index = VectorStoreIndex.from_documents(documents)
# Create query engine
rag_application = index.as_query_engine()
# User query and ground truth answer
user_query = "What is RAGChecker?"
gt_answer = "RAGChecker is an advanced automatic evaluation framework designed to assess and diagnose Retrieval-Augmented Generation (RAG) systems. It provides a comprehensive suite of metrics and tools for in-depth analysis of RAG performance."
# Get response from LlamaIndex
response_object = rag_application.query(user_query)
# Convert to RAGChecker format
rag_result = response_to_rag_results(
query=user_query,
gt_answer=gt_answer,
response_object=response_object,
)
# Create RAGResults object
rag_results = RAGResults.from_dict({"results": [rag_result]})
print(rag_results)
# Initialize RAGChecker
evaluator = RAGChecker(
extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-70b-instruct-v1:0",
batch_size_extractor=32,
batch_size_checker=32,
)
# Evaluate using RAGChecker
evaluator.evaluate(rag_results, all_metrics)
# Print detailed results
print(rag_results)
This integration provides a powerful tool for developing and refining more effective RAG applications by leveraging RAGChecker's comprehensive metrics [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu in the provided solution, context recall and precision is not been evaluated. I want to evaluate Retrieval with precision and recall
To evaluate retrieval with precision and recall in the LlamaIndex framework, you can use the Precision
and Recall
classes provided in the llama_index.core.evaluation.retrieval.metrics
module. Below is an end-to-end example that includes the evaluation of retrieval with precision and recall:
import nest_asyncio
import logging
import sys
import os
import pandas as pd
from llama_index.core.evaluation import (
DatasetGenerator,
RetrieverEvaluator,
generate_question_context_pairs,
Precision,
Recall
)
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex,
Response,
)
from llama_index.llms.openai import OpenAI
# Attach to the same event-loop
nest_asyncio.apply()
# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # Set logger level to INFO
# Clear out any existing handlers
logger.handlers = []
# Set up the StreamHandler to output to sys.stdout
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO) # Set handler level to INFO
# Add the handler to the logger
logger.addHandler(handler)
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-..."
# Download and load data
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
reader = SimpleDirectoryReader("./data/paul_graham/")
documents = reader.load_data()
# Generate questions
gpt4 = OpenAI(model="gpt-4", temperature=0.1)
dataset_generator = DatasetGenerator.from_documents(documents, llm=gpt4, show_progress=True)
eval_dataset = dataset_generator.generate_dataset_from_nodes(num=20)
eval_queries = list(eval_dataset.queries.values())
# Fix evaluation query
eval_query = "How did the author describe their early attempts at writing short stories?"
# Fix GPT-3.5-TURBO LLM for generating response
gpt35 = OpenAI(temperature=0, model="gpt-3.5-turbo")
# Create vector index
vector_index = VectorStoreIndex.from_documents(documents, llm=gpt35)
query_engine = vector_index.as_query_engine()
# Retrieve nodes
retriever = vector_index.as_retriever(similarity_top_k=3)
nodes = retriever.retrieve(eval_query)
# Generate question-context pairs
qa_dataset = generate_question_context_pairs(nodes, llm=gpt4, num_questions_per_chunk=2)
# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Create the RetrieverEvaluator with the extracted retriever
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
# Evaluate on the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
def display_results(name, eval_results):
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
hit_rate = full_df["hit_rate"].mean()
mrr = full_df["mrr"].mean()
precision = full_df["precision"].mean()
recall = full_df["recall"].mean()
metric_df = pd.DataFrame({
"retrievers": [name],
"hit_rate": [hit_rate],
"mrr": [mrr],
"precision": [precision],
"recall": [recall]
})
return metric_df
# Display the results
print(display_results("top-3 eval", eval_results))
In this example:
generate_question_context_pairs
function generates the question-context pairs.RetrieverEvaluator
is initialized with the metrics including precision and recall.aevaluate_dataset
method evaluates the entire dataset.display_results
function calculates and displays the mean values for hit rate, MRR, precision, and recall.This setup ensures that you can evaluate retrieval with precision and recall, along with other metrics like hit rate and MRR [1][2].
To continue talking to Dosu, mention @dosu.
@dosu - in the RAGChecker, how can we create gt_answer and user_query from the nodes. I want to run the RAGCheker for set of user_query and gt_answer and get the response_object for all the set.
Also do we have to get the credentials to access the below AWS bedrock models, can we use any model available in azure services instead of below two:
extractor_name="bedrock/meta.llama3-70b-instruct-v1:0",
checker_name="bedrock/meta.llama3-70b-instruct-v1:0"
To create gt_answer
and user_query
from the nodes to run RAGChecker for a set of user queries and ground truth answers, you can use the KnowledgeGraphRAGRetriever
and RetrieverQueryEngine
from LlamaIndex. Additionally, you can use Azure OpenAI models instead of the AWS Bedrock models.
Here is an example of how to set up and use Azure OpenAI models with LlamaIndex:
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.embeddings.azure_openai import AzureOpenAIEmbedding
from llama_index.core import Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import KnowledgeGraphRAGRetriever
from llama_index.core import StorageContext
from llama_index.graph_stores.nebula import NebulaGraphStore
# Set up Azure OpenAI
api_key = "<api-key>"
azure_endpoint = "https://<your-resource-name>.openai.azure.com/"
api_version = "2023-07-01-preview"
llm = AzureOpenAI(
model="gpt-35-turbo-16k",
deployment_name="my-custom-llm",
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,
)
embed_model = AzureOpenAIEmbedding(
model="text-embedding-ada-002",
deployment_name="my-custom-embedding",
api_key=api_key,
azure_endpoint=azure_endpoint,
api_version=api_version,
)
Settings.llm = llm
Settings.embed_model = embed_model
Settings.chunk_size = 512
# Set up NebulaGraphStore
space_name = "llamaindex"
edge_types, rel_prop_names = ["relationship"], ["relationship"]
tags = ["entity"]
graph_store = NebulaGraphStore(
space_name=space_name,
edge_types=edge_types,
rel_prop_names=rel_prop_names,
tags=tags,
)
storage_context = StorageContext.from_defaults(graph_store=graph_store)
# Set up KnowledgeGraphRAGRetriever
graph_rag_retriever = KnowledgeGraphRAGRetriever(
storage_context=storage_context,
verbose=True,
)
# Set up RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(
graph_rag_retriever,
)
# Example user query
user_query = "Tell me about Peter Quill?"
# Perform the query
response = query_engine.query(user_query)
# Display the response
from IPython.display import display, Markdown
display(Markdown(f"<b>{response}</b>"))
# Ground truth answer (example)
gt_answer = "Peter Quill, also known as Star-Lord, is a fictional character and superhero in the Marvel Cinematic Universe."
# Use the response and ground truth answer for RAGChecker
This setup allows you to use Azure OpenAI models for your LLM and embedding needs, and you can perform queries using the RetrieverQueryEngine
with KnowledgeGraphRAGRetriever
. You can then use the responses and ground truth answers to run RAGChecker.
Additionally, you can optimize your prompts for better RAG performance by following the "Optimization by Prompting" approach. This involves using a meta-prompt to generate better instruction prefixes for your QA prompt template, and evaluating the correctness of the generated answers against a ground-truth dataset. This process can be automated and iteratively improved using LlamaIndex's tools and OpenAI models [1][2].
To continue talking to Dosu, mention @dosu.
@dosu - what does this generate_question_context_pairs
contains and how does RetrieverEvaluator
uses this for evaluating precision and recall?
The generate_question_context_pairs
function generates a set of (question, context) pairs from a given unstructured text corpus using a language model (LLM). It returns an EmbeddingQAFinetuneDataset
object containing queries
, relevant_docs
, and corpus
.
Here's a brief overview of the function and its usage:
generate_question_context_pairs
FunctionThis function generates question-context pairs from a list of nodes (text chunks). It uses an LLM to auto-generate questions from each context chunk.
from llama_index.core.evaluation import generate_question_context_pairs, EmbeddingQAFinetuneDataset
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
The EmbeddingQAFinetuneDataset
object contains:
queries
: A dictionary mapping query IDs to query strings.corpus
: A dictionary mapping document IDs to document strings.relevant_docs
: A dictionary mapping query IDs to lists of relevant document IDs.RetrieverEvaluator
ClassThe RetrieverEvaluator
class uses the generated (question, context) pairs to evaluate retrieval performance. It calculates metrics such as precision and recall by comparing the retrieved results against the ground-truth context.
The Precision
class calculates precision as follows:
retrieved_ids
and expected_ids
to sets.retrieved_set
.retrieved_set = set(retrieved_ids)
expected_set = set(expected_ids)
precision = len(retrieved_set & expected_set) / len(retrieved_set)
The Recall
class calculates recall as follows:
retrieved_ids
and expected_ids
to sets.expected_set
.retrieved_set = set(retrieved_ids)
expected_set = set(expected_ids)
recall = len(retrieved_set & expected_set) / len(expected_set)
Here is an example of how to use RetrieverEvaluator
to evaluate precision and recall:
from llama_index.core.evaluation import RetrieverEvaluator
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
# Evaluation on the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
The aaevaluate_dataset
method performs batch mode evaluation, which is more efficient than evaluating each query separately [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu : As i checked the Precision Class , i want to understand from where does it take expected_ids
and retrieved_ids
. From where and how should i get these expected_ids
and retrieved_ids
.
class Precision(BaseRetrievalMetric):
"""Precision metric.
The `K`-value in `Precision@K` usually corresponds to `top_k` of the retriever.
Attributes:
metric_name (str): The name of the metric.
"""
metric_name: ClassVar[str] = "precision"
def compute(
self,
query: Optional[str] = None,
expected_ids: Optional[List[str]] = None,
retrieved_ids: Optional[List[str]] = None,
expected_texts: Optional[List[str]] = None,
retrieved_texts: Optional[List[str]] = None,
**kwargs: Any,
) -> RetrievalMetricResult:
"""Compute precision based on the provided inputs and selected method.
Parameters:
query (Optional[str]): The query string (not used in the current implementation).
expected_ids (Optional[List[str]]): Expected document IDs.
retrieved_ids (Optional[List[str]]): Retrieved document IDs.
expected_texts (Optional[List[str]]): Expected texts (not used in the current implementation).
retrieved_texts (Optional[List[str]]): Retrieved texts (not used in the current implementation).
Raises:
ValueError: If the necessary IDs are not provided.
Returns:
RetrievalMetricResult: The result with the computed precision score.
"""
# Checking for the required arguments
if (
retrieved_ids is None
or expected_ids is None
or not retrieved_ids
or not expected_ids
):
raise ValueError("Retrieved ids and expected ids must be provided")
retrieved_set = set(retrieved_ids)
expected_set = set(expected_ids)
precision = len(retrieved_set & expected_set) / len(retrieved_set)
return RetrievalMetricResult(score=precision)
The Precision
class in LlamaIndex obtains expected_ids
and retrieved_ids
as parameters to its compute
method. These IDs are typically provided as input to the test cases, as seen in the test_precision
function.
You should get these expected_ids
and retrieved_ids
from the data you are evaluating. The expected_ids
are the ground truth or the correct set of IDs that you expect to retrieve, while the retrieved_ids
are the set of IDs that your retrieval system actually retrieved.
Here is an example of how expected_ids
and retrieved_ids
are used in the test cases:
@pytest.mark.parametrize(
("expected_ids", "retrieved_ids", "expected_result"),
[
(["id1", "id2", "id3"], ["id3", "id1", "id2", "id4"], 3 / 4),
(["id1", "id2", "id3", "id4"], ["id5", "id1"], 1 / 2),
(["id1", "id2"], ["id3", "id4"], 0 / 2),
(["id1", "id2"], ["id2", "id1", "id7"], 2 / 3),
(["id1", "id2", "id3"], ["id3", "id1", "id2", "id4"], 3 / 4),
(["id1", "id2", "id3", "id4"], ["id1", "id2", "id5"], 2 / 3),
(["id1", "id2"], ["id1", "id7", "id15", "id2"], 2 / 4),
],
)
def test_precision(expected_ids, retrieved_ids, expected_result):
prec = Precision()
result = prec.compute(expected_ids=expected_ids, retrieved_ids=retrieved_ids)
assert result.score == pytest.approx(expected_result)
In this example, expected_ids
and retrieved_ids
are provided as parameters to the test_precision
function, which then uses them to compute the precision score.
The Precision
class itself ensures that both expected_ids
and retrieved_ids
are provided and not empty before performing the computation. If these IDs are not provided, it raises a ValueError
. The precision is then calculated as the ratio of the intersection of retrieved_ids
and expected_ids
to the total number of retrieved_ids
[1][2].
To continue talking to Dosu, mention @dosu.
@dosu How can i create a data with expected_ids or ground truth and retrieved_ids. Do i need to use nodes to create this. Can you provide me the code to prepare data with expected_ids or ground truth and retrieved_ids with bunch of sample data.
To create a dataset with expected_ids
(ground truth) and retrieved_ids
, you can use the following code. This code prepares data with expected_ids
and retrieved_ids
using sample data:
import uuid
import re
from llama_index.legacy.evaluation.retrieval.base import RetrievalEvalMode, RetrievalEvalResult, BaseRetrievalEvaluator
from llama_index.legacy.evaluation.retrieval.metrics import resolve_metrics
from llama_index.core.evaluation import LabelledQADataset, generate_question_context_pairs
QUERY_STR_TEMPLATE = "How can I sign a {symbol}?"
def asl_create_labelled_retrieval_dataset(
reg_ex, nodes, mode
) -> LabelledQADataset:
"""Returns a QALabelledDataset that provides the expected node IDs
for every query.
NOTE: this is specific to the ASL use-case.
"""
queries = {}
relevant_docs = {}
for node in nodes:
# find the letter associated with the image/text node
if mode == "image":
string_to_search = node.metadata["file_path"]
elif mode == "text":
string_to_search = node.text
else:
raise ValueError(
"Unsupported mode. Please enter 'image' or 'text'."
)
match = re.search(reg_ex, string_to_search)
if match:
# build the query
query = QUERY_STR_TEMPLATE.format(symbol=match.group(1))
id_ = str(uuid.uuid4())
# store the query and expected ids pair
queries[id_] = query
relevant_docs[id_] = [node.id_]
return LabelledQADataset(
queries=queries, relevant_docs=relevant_docs, corpus={}, mode=mode
)
# Sample data
image_nodes = [
# Add your image nodes here
]
text_nodes = [
# Add your text nodes here
]
# labelled dataset for image retrieval
qa_dataset_image = asl_create_labelled_retrieval_dataset(
r"(?:([A-Z]+).jpg)", image_nodes, "image"
)
# labelled dataset for text retrieval
qa_dataset_text = asl_create_labelled_retrieval_dataset(
r"(?:To sign ([A-Z]+) in ASL:)", text_nodes, "text"
)
# labelled dataset for text-desc
qa_dataset_text_desc = asl_create_labelled_retrieval_dataset(
r"(?:([A-Z]+).jpg)", image_nodes, "image"
)
# Using RetrieverEvaluator
class CustomRetrieverEvaluator(BaseRetrievalEvaluator):
async def _aget_retrieved_ids_and_texts(
self, query: str, mode: RetrievalEvalMode = RetrievalEvalMode.TEXT
) -> Tuple[List[str], List[str]]:
# Implement your retrieval logic here
retrieved_ids = ["sample_id1", "sample_id2"]
retrieved_texts = ["sample_text1", "sample_text2"]
return retrieved_ids, retrieved_texts
retriever_evaluator = CustomRetrieverEvaluator.from_metric_names(
["mrr", "hit_rate"]
)
# Evaluate a single query
retrieval_result = retriever_evaluator.evaluate(
query="How can I sign a B?", expected_ids=["node_id1", "node_id2"]
)
print(retrieval_result)
# Generate synthetic dataset
qa_dataset = generate_question_context_pairs(
nodes=text_nodes, llm=llm, num_questions_per_chunk=2
)
# Evaluate the dataset in batch mode
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
This code defines a function asl_create_labelled_retrieval_dataset
that creates a LabelledQADataset
object with expected_ids
(ground truth) and retrieved_ids
using sample data. You can replace the sample data with your actual data. Additionally, it shows how to use RetrieverEvaluator
to evaluate the dataset. The CustomRetrieverEvaluator
class is a subclass of BaseRetrievalEvaluator
and implements the _aget_retrieved_ids_and_texts
method to provide custom retrieval logic [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu How would i know that the evaluation is working fine on multi context queries.Is there any checks present. Please help me with the step by step process on evaluating Precision ,recall and other metrics considering that I have a retriever , nodes, vector_index available.
To ensure that the evaluation is working correctly on multi-context queries and to evaluate precision, recall, and other metrics, you can follow these steps using the RetrieverEvaluator
class from LlamaIndex. Here is a step-by-step process:
First, load and parse your data into nodes.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.core.node_parser import SentenceSplitter
# Load and parse data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# Set consistent node IDs
for idx, node in enumerate(nodes):
node.id_ = f"node_{idx}"
Initialize your language model (LLM) and create a vector index.
from llama_index.llms.openai import OpenAI
# Initialize LLM
llm = OpenAI(model="gpt-4")
# Create vector index
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)
Generate question-context pairs using the generate_question_context_pairs
function.
from llama_index.core.evaluation import generate_question_context_pairs
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
Define the metrics you want to evaluate and initialize the RetrieverEvaluator
.
from llama_index.core.evaluation import RetrieverEvaluator
# Define metrics
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)
Evaluate a sample query to ensure the setup is correct.
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)
Evaluate the entire dataset to get comprehensive metrics.
import pandas as pd
# Evaluate the dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
# Display results
def display_results(name, eval_results):
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
columns = {
"retrievers": [name],
**{k: [full_df[k].mean()] for k in metrics},
}
metric_df = pd.DataFrame(columns)
return metric_df
print(display_results("top-2 eval", eval_results))
To ensure the evaluation is working correctly, check the following:
Here is the complete example code:
import nest_asyncio
import logging
import sys
import os
import pandas as pd
from llama_index.core.evaluation import (
generate_question_context_pairs,
RetrieverEvaluator
)
from llama_index.core import (
SimpleDirectoryReader,
VectorStoreIndex
)
from llama_index.core.node_parser import SentenceSplitter
from llama_index.llms.openai import OpenAI
# Attach to the same event-loop
nest_asyncio.apply()
# Set up the root logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # Set logger level to INFO
# Clear out any existing handlers
logger.handlers = []
# Set up the StreamHandler to output to sys.stdout
handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.INFO) # Set handler level to INFO
# Add the handler to the logger
logger.addHandler(handler)
# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = "sk-..."
# Load and parse data
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()
node_parser = SentenceSplitter(chunk_size=512)
nodes = node_parser.get_nodes_from_documents(documents)
# Set consistent node IDs
for idx, node in enumerate(nodes):
node.id_ = f"node_{idx}"
# Initialize LLM and vector index
llm = OpenAI(model="gpt-4")
vector_index = VectorStoreIndex(nodes)
retriever = vector_index.as_retriever(similarity_top_k=2)
# Generate question-context pairs
qa_dataset = generate_question_context_pairs(
nodes, llm=llm, num_questions_per_chunk=2
)
# Define metrics
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)
# Evaluate a sample query
sample_id, sample_query = list(qa_dataset.queries.items())[0]
sample_expected = qa_dataset.relevant_docs[sample_id]
eval_result = retriever_evaluator.evaluate(sample_query, sample_expected)
print(eval_result)
# Evaluate the entire dataset
eval_results = await retriever_evaluator.aevaluate_dataset(qa_dataset)
# Display results
def display_results(name, eval_results):
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
full_df = pd.DataFrame(metric_dicts)
columns = {
"retrievers": [name],
**{k: [full_df[k].mean()] for k in metrics},
}
metric_df = pd.DataFrame(columns)
return metric_df
print(display_results("top-2 eval", eval_results))
This setup ensures that you can evaluate precision, recall, and other metrics for multi-context queries using a retriever, nodes, and vector_index [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu, how can i save the eval_results as dataframe
You can save the eval_results
as a DataFrame by first converting the evaluation results into a list of dictionaries and then creating a DataFrame from that list. Here’s how you can do it:
import pandas as pd
# Function to convert eval_results to a DataFrame
def eval_results_to_dataframe(eval_results):
metric_dicts = []
for eval_result in eval_results:
metric_dict = eval_result.metric_vals_dict
metric_dicts.append(metric_dict)
# Create a DataFrame from the list of dictionaries
df = pd.DataFrame(metric_dicts)
return df
# Convert eval_results to DataFrame
eval_results_df = eval_results_to_dataframe(eval_results)
# Save the DataFrame to a CSV file
eval_results_df.to_csv("eval_results.csv", index=False)
# Optionally, display the DataFrame
print(eval_results_df)
This code defines a function eval_results_to_dataframe
that converts the eval_results
into a DataFrame. It then saves the DataFrame to a CSV file named eval_results.csv
. You can modify the file name and path as needed.
To continue talking to Dosu, mention @dosu.
@dosu i want to convert eval_results to dataframe which should contain columns like query, expected_ids, expected_texts, retrieved_ids, retrieved_texts, metric_dict
To convert eval_results
to a DataFrame with columns like query
, expected_ids
, expected_texts
, retrieved_ids
, retrieved_texts
, and metric_dict
, you can use the following code:
import pandas as pd
# Assuming eval_results is a list of RetrievalEvalResult objects
def convert_eval_results_to_dataframe(eval_results):
data = []
for result in eval_results:
data.append({
'query': result.query,
'expected_ids': result.expected_ids,
'expected_texts': result.expected_texts,
'retrieved_ids': result.retrieved_ids,
'retrieved_texts': result.retrieved_texts,
'metric_dict': result.metric_vals_dict
})
df = pd.DataFrame(data)
return df
# Example usage
# eval_results = [...] # List of RetrievalEvalResult objects
df = convert_eval_results_to_dataframe(eval_results)
print(df)
# Save the DataFrame to a CSV file
df.to_csv("eval_results.csv", index=False)
This code will create a DataFrame with the specified columns from the eval_results
list. The metric_dict
column will contain the dictionary of metric values for each evaluation result [1][2].
To continue talking to Dosu, mention @dosu.
@dosu : In the above steps , Can i generate questions using RAGAS testset generator and run the llama_index evalutor on the quesion generated by ragas:
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
Change resulting question type distribution
distributions = {
simple: 0.5,
multi_context: 0.4,
reasoning: 0.1
}
use generator.generate_with_llamaindex_docs if you use llama-index as document loader
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset.to_pandas()
Yes, you can generate questions using the RAGAS testset generator and run the LlamaIndex evaluator on the questions generated by RAGAS. Here’s how you can integrate the RAGAS testset generator with LlamaIndex for evaluation:
First, generate the testset using the RAGAS testset generator.
from ragas import TestsetGenerator
from llama_index.llms.openai import OpenAI
# Initialize LLMs and embeddings
generator_llm = OpenAI(model="gpt-4")
critic_llm = OpenAI(model="gpt-4")
embeddings = OpenAI(model="text-embedding-ada-002")
# Create the testset generator
generator = TestsetGenerator.from_langchain(
generator_llm,
critic_llm,
embeddings
)
# Define the question type distribution
distributions = {
"simple": 0.5,
"multi_context": 0.4,
"reasoning": 0.1
}
# Generate the testset
documents = [...] # Your documents here
testset = generator.generate_with_langchain_docs(documents, 10, distributions)
testset_df = testset.to_pandas()
print(testset_df)
Next, prepare the LlamaIndex evaluator to evaluate the generated questions.
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core import VectorStoreIndex
# Initialize your vector index and retriever
vector_index = VectorStoreIndex.from_documents(documents)
retriever = vector_index.as_retriever(similarity_top_k=2)
# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)
Evaluate the generated questions using the LlamaIndex evaluator.
import pandas as pd
# Convert the testset to a format compatible with LlamaIndex
queries = testset_df['question'].tolist()
expected_ids = testset_df['expected_ids'].tolist()
expected_texts = testset_df['expected_texts'].tolist()
# Create a list to store evaluation results
eval_results = []
# Evaluate each query
for query, exp_ids, exp_texts in zip(queries, expected_ids, expected_texts):
result = retriever_evaluator.evaluate(query, exp_ids)
eval_results.append({
'query': query,
'expected_ids': exp_ids,
'expected_texts': exp_texts,
'retrieved_ids': result.retrieved_ids,
'retrieved_texts': result.retrieved_texts,
'metric_dict': result.metric_vals_dict
})
# Convert evaluation results to a DataFrame
eval_results_df = pd.DataFrame(eval_results)
print(eval_results_df)
# Save the DataFrame to a CSV file
eval_results_df.to_csv("eval_results.csv", index=False)
This setup allows you to generate questions using the RAGAS testset generator and evaluate them using the LlamaIndex evaluator, capturing metrics such as precision, recall, and others [1][2][3][4].
To continue talking to Dosu, mention @dosu.
@dosu which version of llama-index includes metrics like precision , recall. I am currently using Version: 0.10.13.post1 of llama-index but when i ran the code:
from llama_index.core.evaluation import RetrieverEvaluator
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
retriever_evaluator = RetrieverEvaluator.from_metric_names(
metrics, retriever=retriever
)
i got the error as:
ValueError: Invalid metric name: precision
The version of llama-index
that includes metrics like precision and recall is 0.10.46
. Since you are using Version 0.10.13.post1
and encountering a ValueError: Invalid metric name: precision
, it indicates that this version does not include the precision
and recall
metrics. You need to upgrade to version 0.10.46
or later to access these metrics [1][2].
To continue talking to Dosu, mention @dosu.
@dosu in the above code where you have created testset_df
using Ragas TestSetGenerator
, i see the issue in the below line of code provided by you:
# Convert the testset to a format compatible with LlamaIndex
queries = testset_df['question'].tolist()
expected_ids = testset_df['expected_ids'].tolist()
expected_texts = testset_df['expected_texts'].tolist()
testset_df create by ragas TestSetGenerator does not have 'expected_ids' and 'expected_texts'. It contains only 'question', contexts','ground_truth','evolution_type,'metadata','episode_done'
To convert the test set generated by RAGAS TestSetGenerator to a format compatible with LlamaIndex for evaluation, you need to map the fields from the RAGAS test set to the fields expected by LlamaIndex's LabelledEvaluatorDataExample
. Specifically, you need to extract the expected_ids
and expected_texts
from the RAGAS test set.
Here is a Python function that performs this conversion:
from typing import List, Dict
def convert_ragas_to_llamaindex(ragas_testset: List[Dict]) -> List[Dict]:
llamaindex_testset = []
for item in ragas_testset:
query = item['question']
expected_ids = [context['id'] for context in item['contexts']]
expected_texts = [context['text'] for context in item['contexts']]
llamaindex_testset.append({
'query': query,
'expected_ids': expected_ids,
'expected_texts': expected_texts,
'retrieved_ids': [], # This will be filled during evaluation
'retrieved_texts': [], # This will be filled during evaluation
'mode': 'text', # Assuming text mode
'metric_dict': {} # This will be filled during evaluation
})
return llamaindex_testset
# Example usage
ragas_testset = [
{
'question': 'What is the capital of France?',
'contexts': [{'id': '1', 'text': 'Paris is the capital of France.'}],
'ground_truth': 'Paris',
'evolution_type': 'static',
'metadata': {},
'episode_done': True
}
]
llamaindex_testset = convert_ragas_to_llamaindex(ragas_testset)
print(llamaindex_testset)
This function assumes that each context in the RAGAS test set has an id
and text
field. The expected_ids
and expected_texts
are extracted from these contexts and used to create the LlamaIndex-compatible format. The retrieved_ids
, retrieved_texts
, and metric_dict
fields are initialized as empty and will be populated during the evaluation process. The mode
is set to 'text' as per the LabelledEvaluatorDataExample
class in LlamaIndex [1][2][3].
You can then use this converted test set with the RetrieverEvaluator
for evaluation:
from llama_index.core.evaluation import RetrieverEvaluator
import pandas as pd
# Initialize your vector index and retriever
vector_index = VectorStoreIndex.from_documents(documents)
retriever = vector_index.as_retriever(similarity_top_k=2)
# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)
# Evaluate the dataset
eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)
# Convert eval_results to DataFrame
def convert_eval_results_to_dataframe(eval_results):
data = []
for result in eval_results:
data.append({
'query': result.query,
'expected_ids': result.expected_ids,
'expected_texts': result.expected_texts,
'retrieved_ids': result.retrieved_ids,
'retrieved_texts': result.retrieved_texts,
'metric_dict': result.metric_vals_dict
})
df = pd.DataFrame(data)
return df
# Example usage
df = convert_eval_results_to_dataframe(eval_results)
print(df)
# Save the DataFrame to a CSV file
df.to_csv("eval_results.csv", index=False)
This will allow you to evaluate the questions generated by RAGAS using the LlamaIndex evaluator and save the results in a DataFrame.
To continue talking to Dosu, mention @dosu.
@dosu the contexts columns created by RAGAS does not have 'id', it only contain list of answers.
To convert the RAGAS test set, which contains a contexts
column with a list of answers but no id
, to a format compatible with LlamaIndex for evaluation, you can generate unique IDs for each context. Here’s how you can do it:
First, convert the RAGAS test set to include unique IDs for each context.
import pandas as pd
import uuid
def convert_ragas_to_llamaindex(ragas_testset: pd.DataFrame) -> List[Dict]:
llamaindex_testset = []
for _, row in ragas_testset.iterrows():
query = row['question']
contexts = row['contexts']
expected_ids = [str(uuid.uuid4()) for _ in contexts]
expected_texts = contexts
llamaindex_testset.append({
'query': query,
'expected_ids': expected_ids,
'expected_texts': expected_texts,
'retrieved_ids': [], # This will be filled during evaluation
'retrieved_texts': [], # This will be filled during evaluation
'mode': 'text', # Assuming text mode
'metric_dict': {} # This will be filled during evaluation
})
return llamaindex_testset
# Example usage
ragas_testset = pd.DataFrame({
'question': ['What is the capital of France?'],
'contexts': [['Paris is the capital of France.']],
'ground_truth': ['Paris'],
'evolution_type': ['static'],
'metadata': [{}],
'episode_done': [True]
})
llamaindex_testset = convert_ragas_to_llamaindex(ragas_testset)
print(llamaindex_testset)
Next, prepare the LlamaIndex evaluator to evaluate the generated questions.
from llama_index.core.evaluation import RetrieverEvaluator
from llama_index.core import VectorStoreIndex
# Initialize your vector index and retriever
documents = [...] # Your documents here
vector_index = VectorStoreIndex.from_documents(documents)
retriever = vector_index.as_retriever(similarity_top_k=2)
# Define the metrics you want to use
metrics = ["hit_rate", "mrr", "precision", "recall", "ap", "ndcg"]
# Initialize the evaluator
retriever_evaluator = RetrieverEvaluator.from_metric_names(metrics, retriever=retriever)
Evaluate the generated questions using the LlamaIndex evaluator.
import pandas as pd
# Evaluate the dataset
eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)
# Convert eval_results to DataFrame
def convert_eval_results_to_dataframe(eval_results):
data = []
for result in eval_results:
data.append({
'query': result.query,
'expected_ids': result.expected_ids,
'expected_texts': result.expected_texts,
'retrieved_ids': result.retrieved_ids,
'retrieved_texts': result.retrieved_texts,
'metric_dict': result.metric_vals_dict
})
df = pd.DataFrame(data)
return df
# Example usage
df = convert_eval_results_to_dataframe(eval_results)
print(df)
# Save the DataFrame to a CSV file
df.to_csv("eval_results.csv", index=False)
This setup allows you to convert the RAGAS test set to a format compatible with LlamaIndex, evaluate it using the LlamaIndex evaluator, and save the results in a DataFrame [1][2][3].
To continue talking to Dosu, mention @dosu.
@dosu getting below error: eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)
File /anaconda/lib/python3.10/site-packages/llama_index/core/evaluation/retrieval/base.py:188, in BaseRetrievalEvaluator.aevaluate_dataset(self, dataset, workers, show_progress, **kwargs)
AttributeError: 'list' object has no attribute 'mode'
To resolve the AttributeError: 'list' object has no attribute 'mode'
when running eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)
, ensure that the dataset
object passed to aevaluate_dataset
is an instance of EmbeddingQAFinetuneDataset
and has a mode
attribute. The error suggests that dataset
is a list, but it should be an instance of EmbeddingQAFinetuneDataset
with a mode
attribute.
Here's a snippet to illustrate the correct usage:
from llama_index.legacy.finetuning.embeddings.common import EmbeddingQAFinetuneDataset
from llama_index.legacy.evaluation.retrieval.base import RetrievalEvalMode
# Ensure llamaindex_testset is an instance of EmbeddingQAFinetuneDataset
# and has a 'mode' attribute
llamaindex_testset = EmbeddingQAFinetuneDataset(
queries={item['query']: item['query'] for item in llamaindex_testset},
relevant_docs={item['query']: item['expected_ids'] for item in llamaindex_testset},
mode=RetrievalEvalMode.TEXT # or RetrievalEvalMode.IMAGE, depending on your dataset
)
# Now run the evaluation
eval_results = await retriever_evaluator.aevaluate_dataset(llamaindex_testset)
Make sure that llamaindex_testset
is properly initialized and has the mode
attribute set to either RetrievalEvalMode.TEXT
or RetrievalEvalMode.IMAGE
[1][2].
To continue talking to Dosu, mention @dosu.
Question Validation
Question
When evaluating a RAG retrieval service using the llama-index evaluation method, I encountered the following issues:
In the first step, we generate question-context pairs using the code:
However, the output returns
eval_result
is anEmbeddingQAFinetuneDataset
object containing 'queries', 'corpus', and 'relevant_docs'. I don't see any 'context' in thiseval_result
. How is the evaluation being performed using only the 'queries'? Is there any way to generateground_truth
along withcontext
?When using RetrieverEvaluator, I noticed it supports only 'hit_rate' and 'mrr'. Attempting to use 'precision' and 'recall' results in the error:
ValueError: Invalid metric name: precision
. Could you confirm which version of llama-index supports all metrics, including precision and recall? For reference, I am using the following versions: