Closed peguerosdc closed 3 months ago
🤖
Based on the information provided, it seems like the AnswerRelevancyEvaluator
and ContextRelevancyEvaluator
are not producing the expected feedback
format and not computing the passing
field when passed to EvaluationResult
. This could be due to the response object not containing the string "yes" or the string representation of the response object not being in the expected format.
The feedback
field in the AnswerRelevancyEvaluator
and ContextRelevancyEvaluator
classes is expected to be a string that contains detailed feedback on the response according to the criteria questions mentioned in the evaluation template. The feedback should end with a final result in the format: '[RESULT] followed by the float number representing the total score assigned to the response'
.
Here is the relevant code:
DEFAULT_EVAL_TEMPLATE = PromptTemplate(
...
"After your feedback provide a final result by strictly following this format: "
"'[RESULT] followed by the float number representing the total score assigned to the response'\n\n"
...
)
def _default_parser_function(output_str: str) -> Tuple[Optional[float], Optional[str]]:
# Pattern to match the feedback and response
# This pattern looks for any text ending with '[RESULT]' followed by a number
pattern = r"([\s\S]+)(?:\[RESULT\]\s*)([\d.]+)"
# Using regex to find all matches
result = re.search(pattern, output_str)
# Check if any match is found
if result:
# Assuming there's only one match in the text, extract feedback and response
feedback, score = result.groups()
score = float(score) if score is not None else score
return score, feedback.strip()
else:
return None, None
This code defines the expected format of the feedback
field and how it is parsed from the output string. If the feedback
field in the response object does not match this format, it could explain why you are seeing 4/4
instead of 4
and why the passing
field is not being computed.
To fix this issue, you could try ensuring that the feedback
field in the response object matches the expected format. If the issue persists, it would be helpful to have more information about the AnswerRelevancyEvaluator
and ContextRelevancyEvaluator
classes, as they are not mentioned in the provided context.
For more information, you can refer to the following sources:
I hope this helps! If you have any more questions or need further clarification, feel free to ask.
Hi @peguerosdc,
Yes, unfortunately with LLM as Evaluators, they sometimes might not respect the output format. In that case, we simply mark that evaluation result as invalid and provide the output cannot be parsed reason in invalid_reason
. Not sure what we can do here outside of perhaps trying to prompt the evaluator again.
Regarding the passing
variable not being computed for AnswerRelevancy
and ContextRelevancy
, you're right. This variable merely transforms numeric score
(or if score is None, passing is None as well) to a binary pass or fail score. We could set the threshold for passing here to be 50%, of the respective rubrics for AnswerRelevancy and ContextRelevancy.
Curious: are you using score
or passing
more when using our evaluators?
Hi, @peguerosdc
I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The issue concerns the incorrect computation of the passing
field by the AnswerRelevancyEvaluator
and ContextRelevancyEvaluator
, as well as the inconsistent feedback format from the LLM (GPT-4) in version 0.10.4. There have been discussions around potential solutions and handling of invalid evaluation results, but the current status of the issue is unresolved.
Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!
Bug Description
I ran
AnswerRelevancyEvaluator
andContextRelevancyEvaluator
and found the following related issues:feedback
by the LLM (at least by GPT-4) is not always in the desired formatpassing
field is not computed at all when passed toEvaluationResult
I know the first point is related to the unreliability of LLMs, but thought it was worth mentioning as that makes the whole package not reliable as well.
Version
0.10.4
Steps to Reproduce
The following code:
Produces the following response for
context_relevancy
:Where:
feedback
contains the result as a4/4
and not as4
as instructed in the prompt, which is not a big deal now as it is being parsed correctly by the regex.passing = None
because it is not being computed at all.Similarly, it produces the following response for
answer_relevancy
:Where:
passing = None
because it is not being computed at all and instead it's "included" when computing thescore
.Relevant Logs/Tracbacks
No response