[Bug]: LLM evaluated with Llamaindex don't provide scores for certain metrics.

bastienpo commented 6 months ago

Bug Description

Llamaindex crashes when evaluating some large language models for specific metrics (e.g., answer correctness). This happens because these models don't provide scores in their outputs.

Currently, our temporary solution using a try-except prevents crashes, but a more robust approach is may be required.

Implementing proper error to handle missing scores seems like a good approach or providing any kind of feedback to the developers about the missing score could be valuable. I'm open if you have any other solutions that you think could work in this case.

Version

v0.10.1

Steps to Reproduce

Here's the corrected version of the message:

This issue is difficult to reproduce because it's not always deterministic due to its reliance on a large language model (LLM).

Hyperparameter causing the problem:

Using OpenAI embeddings with the "text-embedding-3-large" model.
OpenAI GPT-3 Turbo model (with default hyperparameters).
Metric: correctness.

Data used to reproduce the bug: Unfortunately, I cannot share the data due to confidentiality.

The error appears to be caused by the line score_str = '' (empty string).

https://github.com/run-llama/llama_index/blob/d97cec8ade50f3197f09cc380676c6e4f5288439/llama-index-core/llama_index/core/evaluation/eval_utils.py#L183

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 6 months ago

Hey @Skower! Great to see you back around here. Hope you're doing well!

Thank you for bringing this issue to our attention. It seems like the problem arises when the LlamaIndex tries to evaluate models that do not provide scores in their outputs. Your temporary solution of using a try-except block is a good start, but as you mentioned, a more robust solution is needed.

Based on the context provided, one approach could be to modify the evaluator to handle cases where scores are not available. This can be achieved by implementing a custom evaluator that doesn't rely on scores for its logic.

Here's a conceptual modification to the MockEvaluator class to illustrate handling evaluations without relying on scores:

from typing import Any, Optional, Sequence

from llama_index.core.base.response.schema import Response
from llama_index.core.evaluation import BaseEvaluator
from llama_index.core.evaluation.base import EvaluationResult
from llama_index.core.prompts.mixin import PromptDictType

class CustomEvaluator(BaseEvaluator):
    def __init__(
        self,
        expected_answer: str,
        mock_passing: bool = True,
        mock_feedback: str = "test feedback",
    ) -> None:
        self.expected_answer = expected_answer
        self._mock_passing = mock_passing
        self._mock_feedback = mock_feedback

    def _get_prompts(self) -> PromptDictType:
        """Get prompts."""
        return {}

    def _update_prompts(self, prompts: PromptDictType) -> None:
        """Update prompts."""

    async def aevaluate(
        self,
        query: Optional[str] = None,
        response: Optional[str] = None,
        contexts: Optional[Sequence[str]] = None,
        **kwargs: Any,
    ) -> EvaluationResult:
        # Example logic to determine passing based on matching an expected answer
        passing = response == self.expected_answer
        score = 1.0 if passing else 0.0  # Optionally assign scores based on correctness

        return EvaluationResult(
            query=query,
            contexts=contexts,
            response=response,
            passing=passing,
            score=score,
            feedback=self._mock_feedback if passing else "Incorrect answer",
        )

This modification introduces an expected_answer parameter to the evaluator and changes the evaluation logic to determine passing based on whether the response matches the expected answer. This approach does not rely on scores from the model outputs and can be further customized to handle more complex evaluation criteria.

You can find more information about the MockEvaluator class in the llama-index-core/tests/evaluation/test_base.py file.

Regarding the exact error message or exception that is thrown when LlamaIndex crashes due to missing scores in the model outputs, unfortunately, I wasn't able to find an answer within the repository. It would be helpful if you could provide more details about the error message or traceback you're encountering.

I hope this helps! If you have any further questions or need more clarification, please don't hesitate to ask.

Sources

llama-index-core/tests/evaluation/test_base.py

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

cmishra commented 5 months ago

I threw in some retry logic around my .evaluate call cause i'm not quite familiar enough with llama index's abstractions to do much better atm.

Any examples that could point to how to combine this with a pydantic parser and the self-correcting query engines to do this inside of llama index's abstractions? after all what is the correctness evaluator if not another query over two contexts (the anstwer and reference)

run-llama / llama_index