Closed bastienpo closed 2 months ago
Hey @Skower! Great to see you back around here. Hope you're doing well!
Thank you for bringing this issue to our attention. It seems like the problem arises when the LlamaIndex tries to evaluate models that do not provide scores in their outputs. Your temporary solution of using a try-except block is a good start, but as you mentioned, a more robust solution is needed.
Based on the context provided, one approach could be to modify the evaluator to handle cases where scores are not available. This can be achieved by implementing a custom evaluator that doesn't rely on scores for its logic.
Here's a conceptual modification to the MockEvaluator
class to illustrate handling evaluations without relying on scores:
from typing import Any, Optional, Sequence
from llama_index.core.base.response.schema import Response
from llama_index.core.evaluation import BaseEvaluator
from llama_index.core.evaluation.base import EvaluationResult
from llama_index.core.prompts.mixin import PromptDictType
class CustomEvaluator(BaseEvaluator):
def __init__(
self,
expected_answer: str,
mock_passing: bool = True,
mock_feedback: str = "test feedback",
) -> None:
self.expected_answer = expected_answer
self._mock_passing = mock_passing
self._mock_feedback = mock_feedback
def _get_prompts(self) -> PromptDictType:
"""Get prompts."""
return {}
def _update_prompts(self, prompts: PromptDictType) -> None:
"""Update prompts."""
async def aevaluate(
self,
query: Optional[str] = None,
response: Optional[str] = None,
contexts: Optional[Sequence[str]] = None,
**kwargs: Any,
) -> EvaluationResult:
# Example logic to determine passing based on matching an expected answer
passing = response == self.expected_answer
score = 1.0 if passing else 0.0 # Optionally assign scores based on correctness
return EvaluationResult(
query=query,
contexts=contexts,
response=response,
passing=passing,
score=score,
feedback=self._mock_feedback if passing else "Incorrect answer",
)
This modification introduces an expected_answer
parameter to the evaluator and changes the evaluation logic to determine passing based on whether the response matches the expected answer. This approach does not rely on scores from the model outputs and can be further customized to handle more complex evaluation criteria.
You can find more information about the MockEvaluator
class in the llama-index-core/tests/evaluation/test_base.py file.
Regarding the exact error message or exception that is thrown when LlamaIndex crashes due to missing scores in the model outputs, unfortunately, I wasn't able to find an answer within the repository. It would be helpful if you could provide more details about the error message or traceback you're encountering.
I hope this helps! If you have any further questions or need more clarification, please don't hesitate to ask.
I threw in some retry logic around my .evaluate
call cause i'm not quite familiar enough with llama index's abstractions to do much better atm.
Any examples that could point to how to combine this with a pydantic parser and the self-correcting query engines to do this inside of llama index's abstractions? after all what is the correctness evaluator if not another query over two contexts (the anstwer and reference)
Bug Description
Llamaindex crashes when evaluating some large language models for specific metrics (e.g., answer correctness). This happens because these models don't provide scores in their outputs.
Currently, our temporary solution using a try-except prevents crashes, but a more robust approach is may be required.
Implementing proper error to handle missing scores seems like a good approach or providing any kind of feedback to the developers about the missing score could be valuable. I'm open if you have any other solutions that you think could work in this case.
Version
v0.10.1
Steps to Reproduce
Here's the corrected version of the message:
This issue is difficult to reproduce because it's not always deterministic due to its reliance on a large language model (LLM).
Hyperparameter causing the problem:
Data used to reproduce the bug: Unfortunately, I cannot share the data due to confidentiality.
The error appears to be caused by the line
score_str = ''
(empty string).https://github.com/run-llama/llama_index/blob/d97cec8ade50f3197f09cc380676c6e4f5288439/llama-index-core/llama_index/core/evaluation/eval_utils.py#L183
Relevant Logs/Tracbacks
No response