Closed peguerosdc closed 2 months ago
🤖
The issue you're experiencing seems to be due to the BatchEvalRunner
not properly handling exceptions that occur during the evaluation process. This can be fixed by adding a try/except block around the evaluation calls in the aevaluate_response_strs
, aevaluate_responses
, and aevaluate_queries
methods.
Here is the updated code for the aevaluate_response_strs
method:
async def aevaluate_response_strs(
self,
queries: Optional[List[str]] = None,
response_strs: Optional[List[str]] = None,
contexts_list: Optional[List[List[str]]] = None,
**eval_kwargs_lists: List,
) -> Dict[str, List[EvaluationResult]]:
"""Evaluate query, response pairs.
This evaluates queries, responses, contexts as string inputs.
Can supply additional kwargs to the evaluator in eval_kwargs_lists.
Args:
queries (Optional[List[str]]): List of query strings. Defaults to None.
response_strs (Optional[List[str]]): List of response strings.
Defaults to None.
contexts_list (Optional[List[List[str]]): List of context lists.
Defaults to None.
**eval_kwargs_lists (Dict[str, Any]): Dict of lists of kwargs to
pass to evaluator. Defaults to None.
"""
queries, response_strs, contexts_list = self._validate_and_clean_inputs(
queries, response_strs, contexts_list
)
for k in eval_kwargs_lists:
v = eval_kwargs_lists[k]
if not isinstance(v, list):
raise ValueError(
f"Each value in eval_kwargs must be a list. Got {k}: {v}"
)
eval_kwargs_lists[k] = self._validate_and_clean_inputs(v)[0]
# run evaluations
eval_jobs = []
for idx, query in enumerate(cast(List[str], queries)):
response_str = cast(List, response_strs)[idx]
contexts = cast(List, contexts_list)[idx]
eval_kwargs = self._get_eval_kwargs(eval_kwargs_lists, idx)
for name, evaluator in self.evaluators.items():
eval_jobs.append(
eval_worker(
self.semaphore,
evaluator,
name,
query=query,
response_str=response_str,
contexts=contexts,
eval_kwargs=eval_kwargs,
)
)
try:
results = await self.asyncio_mod.gather(*eval_jobs)
except Exception as e:
print(f"An error occurred during evaluation: {e}")
results = []
# Format results
return self._format_results(results)
You would need to add similar try/except blocks to the aevaluate_responses
and aevaluate_queries
methods as well. This will ensure that any exceptions that occur during the evaluation process are caught and handled, preventing the loss of all results.
You can find the original implementation of the BatchEvalRunner
class in the LlamaIndex repository.
Please let me know if this solution works for you or if you need further assistance.
@dosu-bot if I follow your approach, all the results would still be lost because you are catching the exception for the whole execution and not in every step.
Instead, what you need to do is to catch the exception in the aevaluate
method of each evaluator's class, like in here.
A better solution would be to catch it directly to the BaseEvaluator class so that it is applied effortlessly to every evaluator.
🤖
Thank you for bringing this issue to our attention. It seems like a good idea to add exception handling to the aevaluate
method in the BaseEvaluator
class to prevent the loss of all results when a metric fails to compute. Here's how you can modify the aevaluate
method in the BaseEvaluator
class:
from abc import ABC, abstractmethod
from typing import Any, Optional, Sequence
from llama_index.core.evaluation import EvaluationResult
class BaseEvaluator(ABC):
@abstractmethod
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> EvaluationResult:
"""Run evaluation with query string, retrieved contexts,
and generated response string.
Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
try:
# Your evaluation logic here
raise NotImplementedError
except Exception as e:
return EvaluationResult(
invalid_result=True,
invalid_reason=str(e),
)
In this code, if an exception occurs during the evaluation, it will be caught and an EvaluationResult
object will be returned with invalid_result
set to True
and invalid_reason
set to the exception message. This way, you can prevent the loss of all results when a metric fails to compute.
Please note that you should replace # Your evaluation logic here
with your actual evaluation logic. The raise NotImplementedError
is just a placeholder and should be removed.
This change should be made in the BaseEvaluator
class in the llama_index/core/evaluation/base.py
file in your local copy of the LlamaIndex library. After making this change, you should be able to catch exceptions in the aevaluate
method of each evaluator's class, preventing the loss of all results when a metric fails to compute.
Please let me know if this solution works for you or if you need further assistance.
@dosu-bot you got the idea! But the problem with your implementation is that classes that extend from BaseEvaluator
will override the abstract method aevaluate
and try-except logic would be lost. You can find an example by looking into GuidelineEvaluator
.
🤖
Thank you for providing the detailed information. It seems like you're encountering an issue where an exception in the aevaluate
method of a subclass of BaseEvaluator
causes the loss of all results. This is indeed a problem if the exception is not handled properly.
To address this, you can modify the BaseEvaluator
class to include a wrapper method around aevaluate
that includes the try-except logic. This wrapper method will call the aevaluate
method, which can be overridden in subclasses without affecting the try-except logic.
Here's how you can do it:
class BaseEvaluator(PromptMixin):
"""Base Evaluator class."""
# ...
async def aevaluate_wrapper(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> EvaluationResult:
"""Wrapper for aevaluate method with try-except logic."""
try:
return await self.aevaluate(
query=query,
response=response,
contexts=contexts,
**kwargs,
)
except Exception as e:
return EvaluationResult(
invalid_result=True,
invalid_reason=str(e),
)
@abstractmethod
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> EvaluationResult:
"""Run evaluation with query string, retrieved contexts,
and generated response string.
Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
raise NotImplementedError
# ...
def evaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> EvaluationResult:
"""Run evaluation with query string, retrieved contexts,
and generated response string.
Subclasses can override this method to provide custom evaluation logic and
take in additional arguments.
"""
return asyncio.run(
self.aevaluate_wrapper(
query=query,
response=response,
contexts=contexts,
**kwargs,
)
)
In this code, the aevaluate_wrapper
method includes a try-except block that catches any exceptions thrown by the aevaluate
method. If an exception is caught, it returns an EvaluationResult
with invalid_result
set to True
and invalid_reason
set to the string representation of the exception. The evaluate
method now calls aevaluate_wrapper
instead of aevaluate
, preserving the try-except logic even when aevaluate
is overridden in subclasses.
This change should help to prevent the loss of all results when a metric fails to compute. Please let me know if you have any questions or need further assistance.
caught this bug too. hopefully this can be fixed
@peguerosdc Did you manage to find a way to resolve this?
Bug Description
When running
BatchEvalRunner
on a large amount of test cases and metrics, if one of them fails (I found it common to happen withGuidelineEvaluator
due to its unreliable JSON parsing), the exception is not caught and you lose all the other results, resulting in a loss of time and money.Version
0.10.4
Steps to Reproduce
It doesn't always fail, so you might need to run it a couple of times.
Relevant Logs/Tracbacks