[Bug]: Some evaluators (at least AnswerRelevancy and ContextRelevancy) don't compute the score correctly and/or don't compute passing at all

peguerosdc commented 7 months ago

Bug Description

I ran AnswerRelevancyEvaluator and ContextRelevancyEvaluator and found the following related issues:

Thefeedback by the LLM (at least by GPT-4) is not always in the desired format
The passing field is not computed at all when passed to EvaluationResult

I know the first point is related to the unreliability of LLMs, but thought it was worth mentioning as that makes the whole package not reliable as well.

Version

0.10.4

Steps to Reproduce

The following code:

from llama_index.core.evaluation import AnswerRelevancyEvaluator, ContextRelevancyEvaluator
from llama_index.core.evaluation import BatchEvalRunner
from llama_index.llms.openai import OpenAI

llm = OpenAI("gpt-4")

runner = BatchEvalRunner(
    {
     "answer_relevancy": AnswerRelevancyEvaluator(llm=llm),
     "context_relevancy": ContextRelevancyEvaluator(llm=llm),
     },
    workers=2,
)

eval_results = await runner.aevaluate_response_strs(
    queries=["What are the airports in New York City?"],
    response_strs=["The airports in New York City include John F. Kennedy International Airport, Newark Liberty International Airport, LaGuardia Airport, Stewart International Airport, Long Island MacArthur Airport, Trenton-Mercer Airport, and Westchester County Airport."],
    contexts_list=[["Like the New York City Subway, the PATH operates 24 hours a day; meaning three of the six rapid transit systems in the world which operate on 24-hour schedules are wholly or partly in New York (the others are a portion of the Chicago 'L', the PATCO Speedline serving Philadelphia, and the Copenhagen Metro).Multibillion-dollar heavy rail transit projects under construction in New York City include the Second Avenue Subway, and the East Side Access project. ==== Buses ==== New York City's public bus fleet runs 24/7 and is the largest in North America. The Port Authority Bus Terminal, the main intercity bus terminal of the city, serves 7,000 buses and 200,000 commuters daily, making it the busiest bus station in the world. === Air === New York's airspace is the busiest in the United States and one of the world's busiest air transportation corridors. The three busiest airports in the New York metropolitan area include John F. Kennedy International Airport, Newark Liberty International Airport, and LaGuardia Airport; 130.5 million travelers used these three airports in 2016. JFK and Newark Liberty were the busiest and fourth busiest U.S. gateways for international air passengers, respectively, in 2012; as of 2011, JFK was the busiest airport for international passengers in North America.Plans have advanced to expand passenger volume at a fourth airport, Stewart International Airport near Newburgh, New York, by the Port Authority of New York and New Jersey. Plans were announced in July 2015 to entirely rebuild LaGuardia Airport in a multibillion-dollar project to replace its aging facilities. Other commercial airports in or serving the New York metropolitan area include Long Island MacArthur Airport, Trenton–Mercer Airport and Westchester County Airport. The primary general aviation airport serving the area is Teterboro Airport."]],
)

Produces the following response for context_relevancy:

EvaluationResult(
    query='What are the airports in New York City?',
    contexts=["Like the New York City Subway, [...]"],
    feedback="1. The retrieved context does match the subject matter of the user's query. It provides information about the airports in New York City, including John F. Kennedy International Airport, Newark Liberty International Airport, LaGuardia Airport, Stewart International Airport, Long Island MacArthur Airport, Trenton–Mercer Airport, and Westchester County Airport. It also mentions Teterboro Airport as the primary general aviation airport serving the area. \n2. The retrieved context can be used exclusively to provide a full answer to the user's query. It not only lists the airports in New York City but also provides additional information about their usage, plans for expansion, and their ranking in terms of passenger volume.\n\n[RESULT] 4/4",
    response=None, pairwise_source=None, invalid_result=False, invalid_reason=None,
    passing=None, score=1.0
)

Where:

feedback contains the result as a 4/4 and not as 4 as instructed in the prompt, which is not a big deal now as it is being parsed correctly by the regex.
passing = None because it is not being computed at all.

Similarly, it produces the following response for answer_relevancy:

EvaluationResult(
    query='What are the airports in New York City?',
    response='The airports in New York City include [...]',
    feedback="1. Yes, the provided response matches the subject matter of the user's query. The user asked for the airports in New York City and the response lists several airports in and around New York City.\n2. Yes, the provided response attempts to address the focus or perspective on the subject matter taken on by the user's query. The user wanted to know about the airports in New York City and the response provides a list of these airports.\n\n[RESULT] 2",
    pairwise_source=None, invalid_result=False, invalid_reason=None, contexts=None,
    score=1.0, passing=None
)

Where:

passing = None because it is not being computed at all and instead it's "included" when computing the score.

Relevant Logs/Tracbacks

No response

dosubot[bot] commented 7 months ago

🤖

Based on the information provided, it seems like the AnswerRelevancyEvaluator and ContextRelevancyEvaluator are not producing the expected feedback format and not computing the passing field when passed to EvaluationResult. This could be due to the response object not containing the string "yes" or the string representation of the response object not being in the expected format.

The feedback field in the AnswerRelevancyEvaluator and ContextRelevancyEvaluator classes is expected to be a string that contains detailed feedback on the response according to the criteria questions mentioned in the evaluation template. The feedback should end with a final result in the format: '[RESULT] followed by the float number representing the total score assigned to the response'.

Here is the relevant code:

DEFAULT_EVAL_TEMPLATE = PromptTemplate(
    ...
    "After your feedback provide a final result by strictly following this format: "
    "'[RESULT] followed by the float number representing the total score assigned to the response'\n\n"
    ...
)

def _default_parser_function(output_str: str) -> Tuple[Optional[float], Optional[str]]:
    # Pattern to match the feedback and response
    # This pattern looks for any text ending with '[RESULT]' followed by a number
    pattern = r"([\s\S]+)(?:\[RESULT\]\s*)([\d.]+)"

    # Using regex to find all matches
    result = re.search(pattern, output_str)

    # Check if any match is found
    if result:
        # Assuming there's only one match in the text, extract feedback and response
        feedback, score = result.groups()
        score = float(score) if score is not None else score
        return score, feedback.strip()
    else:
        return None, None

This code defines the expected format of the feedback field and how it is parsed from the output string. If the feedback field in the response object does not match this format, it could explain why you are seeing 4/4 instead of 4 and why the passing field is not being computed.

To fix this issue, you could try ensuring that the feedback field in the response object matches the expected format. If the issue persists, it would be helpful to have more information about the AnswerRelevancyEvaluator and ContextRelevancyEvaluator classes, as they are not mentioned in the provided context.

For more information, you can refer to the following sources:

I hope this helps! If you have any more questions or need further clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

nerdai commented 7 months ago

Hi @peguerosdc,

Yes, unfortunately with LLM as Evaluators, they sometimes might not respect the output format. In that case, we simply mark that evaluation result as invalid and provide the output cannot be parsed reason in invalid_reason. Not sure what we can do here outside of perhaps trying to prompt the evaluator again.

Regarding the passing variable not being computed for AnswerRelevancy and ContextRelevancy, you're right. This variable merely transforms numeric score (or if score is None, passing is None as well) to a binary pass or fail score. We could set the threshold for passing here to be 50%, of the respective rubrics for AnswerRelevancy and ContextRelevancy.

Curious: are you using score or passing more when using our evaluators?

dosubot[bot] commented 4 months ago

Hi, @peguerosdc

I'm helping the LlamaIndex team manage their backlog and am marking this issue as stale. The issue concerns the incorrect computation of the passing field by the AnswerRelevancyEvaluator and ContextRelevancyEvaluator, as well as the inconsistent feedback format from the LLM (GPT-4) in version 0.10.4. There have been discussions around potential solutions and handling of invalid evaluation results, but the current status of the issue is unresolved.

Could you please confirm if this issue is still relevant to the latest version of the LlamaIndex repository? If it is, please let the LlamaIndex team know by commenting on the issue. Otherwise, feel free to close the issue yourself, or it will be automatically closed in 7 days. Thank you!

run-llama / llama_index