[Bug]: BatchEvalRunner not catching exception when a metric fails to be computed

Bug Description

When running BatchEvalRunner on a large amount of test cases and metrics, if one of them fails (I found it common to happen with GuidelineEvaluator due to its unreliable JSON parsing), the exception is not caught and you lose all the other results, resulting in a loss of time and money.

Version

0.10.4

Steps to Reproduce

It doesn't always fail, so you might need to run it a couple of times.

from llama_index.core.evaluation import GuidelineEvaluator
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.llms.openai import OpenAI

llm = OpenAI("gpt-4")

runner = BatchEvalRunner(
    {
     "my_guideline": GuidelineEvaluator(llm=llm, guidelines="The response should fully answer the query.")
     },
    workers=2,
    show_progress=True,
)

eval_results = await runner.aevaluate_response_strs(
    queries=["Limite de credito\n?\nOi"],
    response_strs=['Olá! Para ajustar o limite de crédito disponível no seu cartão, você precisa seguir os seguintes passos:\n\n1. Clique na opção: **Cartão de Crédito**, na tela inicial do aplicativo;\n2. Selecione: **Meus Limites**;\n3. Clique para digitar o valor ou mova o marcador roxo até o valor desejado, dentro do limite total do cartão.\n\nLembrando que **não realizamos análise e liberação de limite de crédito nos canais de atendimento**.\n\nA soma do seu **Limite Disponível** mais o **Valor Antecipado** indicará o limite total que você possui no momento. Conforme você realizar novas compras, esse limite será consumido até acabar. Após isso, as compras seguintes consumirão de seu limite normal. \n\nEntre o período das 20h até as 6h, o limite de pagamento é de R$1.000,00 de acordo com a resolução número 142 do Bacen.\n\nCaso sua dúvida seja sobre antecipação de parcelas de financiamentos, você pode acessar o tópico “” no "Me Ajuda".\n\nVocê gostaria de ser transferido para um agente agora?'],
)

Relevant Logs/Tracbacks

{
    "name": "ValidationError",
    "message": "1 validation error for EvaluationData
__root__
  Expecting ',' delimiter: line 1 column 349 (char 348) (type=value_error.jsondecode; msg=Expecting ',' delimiter; doc={\"passing\": false, \"feedback\": \"The response is detailed and provides a step-by-step guide on how to adjust the credit limit, which is helpful. However, the response fails to fully answer the query as it does not clarify what 'Limite de credito' means. The response also includes a placeholder “” in the sentence 'você pode acessar o tópico “” no \"Me Ajuda\"', which should be replaced with relevant information. Lastly, the offer to transfer to an agent seems unnecessary as the query was not a request for a live agent.\"}; pos=348; lineno=1; colno=349)",
    "stack": "---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/main.py:539, in pydantic.main.BaseModel.parse_raw()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/parse.py:37, in pydantic.parse.load_str_bytes()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/__init__.py:346, in loads(s, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    343 if (cls is None and object_hook is None and
    344         parse_int is None and parse_float is None and
    345         parse_constant is None and object_pairs_hook is None and not kw):
--> 346     return _default_decoder.decode(s)
    347 if cls is None:

File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/decoder.py:337, in JSONDecoder.decode(self, s, _w)
    333 \"\"\"Return the Python representation of ``s`` (a ``str`` instance
    334 containing a JSON document).
    335 
    336 \"\"\"
--> 337 obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    338 end = _w(s, end).end()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/json/decoder.py:353, in JSONDecoder.raw_decode(self, s, idx)
    352 try:
--> 353     obj, end = self.scan_once(s, idx)
    354 except StopIteration as err:

JSONDecodeError: Expecting ',' delimiter: line 1 column 349 (char 348)

During handling of the above exception, another exception occurred:

ValidationError                           Traceback (most recent call last)
Cell In[8], line 15
      5 llm = OpenAI(\"gpt-4\")
      7 runner = BatchEvalRunner(
      8     {
      9      \"my_guideline\": GuidelineEvaluator(llm=llm, guidelines=\"The response should fully answer the query.\")
   (...)
     12     show_progress=True,
     13 )
---> 15 eval_results = await runner.aevaluate_response_strs(
     16     queries=[\"Limite de credito\
?\
Oi\"],
     17     response_strs=['Olá! Para ajustar o limite de crédito disponível no seu cartão, você precisa seguir os seguintes passos:\
\
1. Clique na opção: **Cartão de Crédito**, na tela inicial do aplicativo;\
2. Selecione: **Meus Limites**;\
3. Clique para digitar o valor ou mova o marcador roxo até o valor desejado, dentro do limite total do cartão.\
\
Lembrando que **não realizamos análise e liberação de limite de crédito nos canais de atendimento**.\
\
A soma do seu **Limite Disponível** mais o **Valor Antecipado** indicará o limite total que você possui no momento. Conforme você realizar novas compras, esse limite será consumido até acabar. Após isso, as compras seguintes consumirão de seu limite normal. \
\
Entre o período das 20h até as 6h, o limite de pagamento é de R$1.000,00 de acordo com a resolução número 142 do Bacen.\
\
Caso sua dúvida seja sobre antecipação de parcelas de financiamentos, você pode acessar o tópico “” no \"Me Ajuda\".\
\
Você gostaria de ser transferido para um agente agora?'],
     18 )

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/batch_runner.py:188, in BatchEvalRunner.aevaluate_response_strs(self, queries, response_strs, contexts_list, **eval_kwargs_lists)
    176     for name, evaluator in self.evaluators.items():
    177         eval_jobs.append(
    178             eval_worker(
    179                 self.semaphore,
   (...)
    186             )
    187         )
--> 188 results = await self.asyncio_mod.gather(*eval_jobs)
    190 # Format results
    191 return self._format_results(results)

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:79, in tqdm_asyncio.gather(cls, loop, timeout, total, *fs, **tqdm_kwargs)
     76     return i, await f
     78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
     80                                          total=total, **tqdm_kwargs)]
     81 return [i for _, i in sorted(res)]

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:79, in <listcomp>(.0)
     76     return i, await f
     78 ifs = [wrap_awaitable(i, f) for i, f in enumerate(fs)]
---> 79 res = [await f for f in cls.as_completed(ifs, loop=loop, timeout=timeout,
     80                                          total=total, **tqdm_kwargs)]
     81 return [i for _, i in sorted(res)]

File ~/miniforge3/envs/project-evaluation/lib/python3.9/asyncio/tasks.py:611, in as_completed.<locals>._wait_for_one()
    608 if f is None:
    609     # Dummy value from _on_timeout().
    610     raise exceptions.TimeoutError
--> 611 return f.result()

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/tqdm/asyncio.py:76, in tqdm_asyncio.gather.<locals>.wrap_awaitable(i, f)
     75 async def wrap_awaitable(i, f):
---> 76     return i, await f

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/batch_runner.py:43, in eval_worker(semaphore, evaluator, evaluator_name, query, response_str, contexts, eval_kwargs)
     39 eval_kwargs = eval_kwargs or {}
     40 async with semaphore:
     41     return (
     42         evaluator_name,
---> 43         await evaluator.aevaluate(
     44             query=query, response=response_str, contexts=contexts, **eval_kwargs
     45         ),
     46     )

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/evaluation/guideline.py:115, in GuidelineEvaluator.aevaluate(***failed resolving arguments***)
    107 await asyncio.sleep(sleep_time_in_seconds)
    109 eval_response = await self._llm.apredict(
    110     self._eval_template,
    111     query=query,
    112     response=response,
    113     guidelines=self._guidelines,
    114 )
--> 115 eval_data = self._output_parser.parse(eval_response)
    116 eval_data = cast(EvaluationData, eval_data)
    118 return EvaluationResult(
    119     query=query,
    120     response=response,
   (...)
    123     feedback=eval_data.feedback,
    124 )

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/llama_index/core/output_parsers/pydantic.py:62, in PydanticOutputParser.parse(self, text)
     60 \"\"\"Parse, validate, and correct errors programmatically.\"\"\"
     61 json_str = extract_json_str(text)
---> 62 return self._output_cls.parse_raw(json_str)

File ~/miniforge3/envs/project-evaluation/lib/python3.9/site-packages/pydantic/main.py:548, in pydantic.main.BaseModel.parse_raw()

ValidationError: 1 validation error for EvaluationData
__root__
  Expecting ',' delimiter: line 1 column 349 (char 348) (type=value_error.jsondecode; msg=Expecting ',' delimiter; doc={\"passing\": false, \"feedback\": \"The response is detailed and provides a step-by-step guide on how to adjust the credit limit, which is helpful. However, the response fails to fully answer the query as it does not clarify what 'Limite de credito' means. The response also includes a placeholder “” in the sentence 'você pode acessar o tópico “” no \"Me Ajuda\"', which should be replaced with relevant information. Lastly, the offer to transfer to an agent seems unnecessary as the query was not a request for a live agent.\"}; pos=348; lineno=1; colno=349)"
}

🤖

The issue you're experiencing seems to be due to the BatchEvalRunner not properly handling exceptions that occur during the evaluation process. This can be fixed by adding a try/except block around the evaluation calls in the aevaluate_response_strs, aevaluate_responses, and aevaluate_queries methods.

Here is the updated code for the aevaluate_response_strs method:

async def aevaluate_response_strs(
    self,
    queries: Optional[List[str]] = None,
    response_strs: Optional[List[str]] = None,
    contexts_list: Optional[List[List[str]]] = None,
    **eval_kwargs_lists: List,
) -> Dict[str, List[EvaluationResult]]:
    """Evaluate query, response pairs.

    This evaluates queries, responses, contexts as string inputs.
    Can supply additional kwargs to the evaluator in eval_kwargs_lists.

    Args:
        queries (Optional[List[str]]): List of query strings. Defaults to None.
        response_strs (Optional[List[str]]): List of response strings.
            Defaults to None.
        contexts_list (Optional[List[List[str]]): List of context lists.
            Defaults to None.
        **eval_kwargs_lists (Dict[str, Any]): Dict of lists of kwargs to
            pass to evaluator. Defaults to None.
    """
    queries, response_strs, contexts_list = self._validate_and_clean_inputs(
        queries, response_strs, contexts_list
    )
    for k in eval_kwargs_lists:
        v = eval_kwargs_lists[k]
        if not isinstance(v, list):
            raise ValueError(
                f"Each value in eval_kwargs must be a list. Got {k}: {v}"
            )
        eval_kwargs_lists[k] = self._validate_and_clean_inputs(v)[0]

    # run evaluations
    eval_jobs = []
    for idx, query in enumerate(cast(List[str], queries)):
        response_str = cast(List, response_strs)[idx]
        contexts = cast(List, contexts_list)[idx]
        eval_kwargs = self._get_eval_kwargs(eval_kwargs_lists, idx)
        for name, evaluator in self.evaluators.items():
            eval_jobs.append(
                eval_worker(
                    self.semaphore,
                    evaluator,
                    name,
                    query=query,
                    response_str=response_str,
                    contexts=contexts,
                    eval_kwargs=eval_kwargs,
                )
            )
    try:
        results = await self.asyncio_mod.gather(*eval_jobs)
    except Exception as e:
        print(f"An error occurred during evaluation: {e}")
        results = []

    # Format results
    return self._format_results(results)

You would need to add similar try/except blocks to the aevaluate_responses and aevaluate_queries methods as well. This will ensure that any exceptions that occur during the evaluation process are caught and handled, preventing the loss of all results.

You can find the original implementation of the BatchEvalRunner class in the LlamaIndex repository.

Please let me know if this solution works for you or if you need further assistance.

Sources

llama_index/evaluation/batch_runner.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

@dosu-bot if I follow your approach, all the results would still be lost because you are catching the exception for the whole execution and not in every step.

Instead, what you need to do is to catch the exception in the aevaluate method of each evaluator's class, like in here.

A better solution would be to catch it directly to the BaseEvaluator class so that it is applied effortlessly to every evaluator.

🤖

Thank you for bringing this issue to our attention. It seems like a good idea to add exception handling to the aevaluate method in the BaseEvaluator class to prevent the loss of all results when a metric fails to compute. Here's how you can modify the aevaluate method in the BaseEvaluator class:

from abc import ABC, abstractmethod
from typing import Any, Optional, Sequence
from llama_index.core.evaluation import EvaluationResult

class BaseEvaluator(ABC):
    @abstractmethod
    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Run evaluation with query string, retrieved contexts,
        and generated response string.

        Subclasses can override this method to provide custom evaluation logic and
        take in additional arguments.
        """
        try:
            # Your evaluation logic here
            raise NotImplementedError
        except Exception as e:
            return EvaluationResult(
                invalid_result=True,
                invalid_reason=str(e),
            )

In this code, if an exception occurs during the evaluation, it will be caught and an EvaluationResult object will be returned with invalid_result set to True and invalid_reason set to the exception message. This way, you can prevent the loss of all results when a metric fails to compute.

Please note that you should replace # Your evaluation logic here with your actual evaluation logic. The raise NotImplementedError is just a placeholder and should be removed.

This change should be made in the BaseEvaluator class in the llama_index/core/evaluation/base.py file in your local copy of the LlamaIndex library. After making this change, you should be able to catch exceptions in the aevaluate method of each evaluator's class, preventing the loss of all results when a metric fails to compute.

Please let me know if this solution works for you or if you need further assistance.

Sources

llama_index/evaluation/base.py

#### About Dosu

@dosu-bot you got the idea! But the problem with your implementation is that classes that extend from BaseEvaluator will override the abstract method aevaluate and try-except logic would be lost. You can find an example by looking into GuidelineEvaluator.

🤖

Thank you for providing the detailed information. It seems like you're encountering an issue where an exception in the aevaluate method of a subclass of BaseEvaluator causes the loss of all results. This is indeed a problem if the exception is not handled properly.

To address this, you can modify the BaseEvaluator class to include a wrapper method around aevaluate that includes the try-except logic. This wrapper method will call the aevaluate method, which can be overridden in subclasses without affecting the try-except logic.

Here's how you can do it:

class BaseEvaluator(PromptMixin):
    """Base Evaluator class."""

    # ...

    async def aevaluate_wrapper(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Wrapper for aevaluate method with try-except logic."""
        try:
            return await self.aevaluate(
                query=query,
                response=response,
                contexts=contexts,
                **kwargs,
            )
        except Exception as e:
            return EvaluationResult(
                invalid_result=True,
                invalid_reason=str(e),
            )

    @abstractmethod
    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Run evaluation with query string, retrieved contexts,
        and generated response string.

        Subclasses can override this method to provide custom evaluation logic and
        take in additional arguments.
        """
        raise NotImplementedError

    # ...

    def evaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        """Run evaluation with query string, retrieved contexts,
        and generated response string.

        Subclasses can override this method to provide custom evaluation logic and
        take in additional arguments.
        """
        return asyncio.run(
            self.aevaluate_wrapper(
                query=query,
                response=response,
                contexts=contexts,
                **kwargs,
            )
        )

In this code, the aevaluate_wrapper method includes a try-except block that catches any exceptions thrown by the aevaluate method. If an exception is caught, it returns an EvaluationResult with invalid_result set to True and invalid_reason set to the string representation of the exception. The evaluate method now calls aevaluate_wrapper instead of aevaluate, preserving the try-except logic even when aevaluate is overridden in subclasses.

This change should help to prevent the loss of all results when a metric fails to compute. Please let me know if you have any questions or need further assistance.

Sources

llama_index/evaluation/base.py

#### About Dosu

caught this bug too. hopefully this can be fixed

@peguerosdc Did you manage to find a way to resolve this?

run-llama / llama_index