run-llama / llama_index

LlamaIndex is a data framework for your LLM applications
https://docs.llamaindex.ai
MIT License
33.38k stars 4.67k forks source link

[Question]: How were the prompts and scoring logic define for AnswerRelevancyEvaluator, ContextRelevancyEvaluator, AnswerRelevancyEvaluator and FaithfulnessEvaluator #14432

Open thulard opened 1 week ago

thulard commented 1 week ago

Question Validation

Question

Hi,

I was wondering on what base the prompt used by these metrics were defined. Was there any research that support these specific prompts?

I'll also like to understand why the scores systems are so different from one metric to the next.

Thank you

dosubot[bot] commented 1 week ago

Hey @thulard! I'm here to help you with any questions you have while waiting for a human maintainer. If you need assistance with bugs, questions, or becoming a contributor, feel free to reach out.

The prompts and scoring systems for the AnswerRelevancyEvaluator, ContextRelevancyEvaluator, and FaithfulnessEvaluator were designed to assess different aspects of generated responses using both standard LLMs and Multi-Modal LLMs. Here's a breakdown:

  1. Correctness (LLM-As-A-Judge): This evaluates the correctness of the response without considering the context, using a standard LLM like GPT-4. The scoring is on a scale from 1 to 5.
  2. Faithfulness (LMM-As-A-Judge): This evaluates whether the response is faithful to the provided context, requiring both images and text. It uses a Multi-Modal LLM (LMM) and provides a binary score (YES or NO).
  3. Relevancy (LMM-As-A-Judge): This evaluates the relevancy of the response to the provided context, also requiring both images and text. It uses a Multi-Modal LLM (LMM) and provides a binary score (YES or NO).

The specific prompts and scoring differences are designed to leverage the unique strengths of each model type appropriately. The use of different models ensures a comprehensive evaluation of the responses.

Supporting evidence from the document:

### Correctness, Faithfulness, Relevancy

With the generated responses in hand (stored in a custom data object tailored for this ASL use case, namely: `response_data`). We can now compute metrics for their evaluation:
- Correctness (LLM-As-A-Judge):
- Faithfulness (LMM-As-A-Judge):
- Relevancy (LMM-As-A-Judge):

To compute all three, we prompt another generative model to provide a score assessing each of their own criterion. For Correctness, since we're not considering context, then the judge is an LLM. In contrast, to compute Faithfulness and Relevancy, we're required to pass in the context, meaning both images and text that was supplied to the RAG to generate the response in the first place. As a result of this requirement to pass in both image and text, the judges for Faithfulness and Relevancy must be an LMM (or Multi-Modal LLM).

Code snippet showing the setup of the evaluators:

from llama_index.llms.openai import OpenAI
from llama_index.core.evaluation import CorrectnessEvaluator
from llama_index.core.evaluation.multi_modal import (
    MultiModalRelevancyEvaluator,
    MultiModalFaithfulnessEvaluator,
)

judges = {}

judges["correctness"] = CorrectnessEvaluator(
    llm=OpenAI(temperature=0, model="gpt-4"),
)

judges["relevancy"] = MultiModalRelevancyEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

judges["faithfulness"] = MultiModalFaithfulnessEvaluator(
    multi_modal_llm=OpenAIMultiModal(
        model="gpt-4-vision-preview",
        max_new_tokens=300,
    )
)

Correctness Evaluation Prompt

prometheus_correctness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query, a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given. 
            1. Write a detailed feedback that assesses the quality of the response strictly based on the given score rubric, not evaluating in general. 
            2. After writing a feedback, write a score that is either 1 or 2 or 3 or 4 or 5. You should refer to the score rubric. 
            3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (1 or 2 or 3 or 4 or 5)" 
            4. Please do not generate any other opening, closing, and explanations. 
            5. Only evaluate on common things between generated answer and reference answer. Don't evaluate on things which are present in reference answer but not in generated answer.

            ###The instruction to evaluate: Your task is to evaluate the generated answer and reference answer for the query: {query}

            ###Generate answer to evaluate: {generated_answer} 

            ###Reference Answer (Score 5): {reference_answer}

            ###Score Rubrics: 
            Score 1: If the generated answer is not relevant to the user query and reference answer.
            Score 2: If the generated answer is according to reference answer but not relevant to user query.
            Score 3: If the generated answer is relevant to the user query and reference answer but contains mistakes.
            Score 4: If the generated answer is relevant to the user query and has the exact same metrics as the reference answer, but it is not as concise.
            Score 5: If the generated answer is relevant to the user query and fully correct according to the reference answer.

            ###Feedback:"""

Faithfulness Evaluation Prompt

prometheus_faithfulness_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), an information, a context, and a score rubric representing evaluation criteria are given. 
            1. You are provided with evaluation task with the help of information, context information to give result based on score rubrics.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
            3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the given piece of information is supported by context.

        ###Information: {query_str} 

        ###Context: {context_str}

        ###Score Rubrics: 
        Score YES: If the given piece of information is supported by context.
        Score NO: If the given piece of information is not supported by context

        ###Feedback: """

Relevancy Evaluation Prompt

prometheus_relevancy_eval_prompt_template = """###Task Description: An instruction (might include an Input inside it), a query with response, context, and a score rubric representing evaluation criteria are given. 
            1. You are provided with evaluation task with the help of a query with response and context.
            2. Write a detailed feedback based on evaluation task and the given score rubric, not evaluating in general. 
            3. After writing a feedback, write a score that is YES or NO. You should refer to the score rubric. 
            4. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (YES or NO)” 
            5. Please do not generate any other opening, closing, and explanations. 

        ###The instruction to evaluate: Your task is to evaluate if the response for the query is in line with the context information provided.

        ###Query and Response: {query_str} 

        ###Context: {context_str}

        ###Score Rubrics: 
        Score YES: If the response for the query is in line with the context information provided.
        Score NO: If the response for the query is not in line with the context information provided.

        ###Feedback: """

[1][2]

To continue talking to Dosu, mention @dosu.