truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.19k stars 189 forks source link

Comprehensiveness not providing reasons #889

Closed rajib76 closed 9 months ago

rajib76 commented 9 months ago

openai_provider.comprehensiveness_with_cot_reasons(source=, summary=<>)

If you execute the above code, the current COT prompting does not capture the supporting evidence, it is coming as empty. I am suggesting the below solutions

In prompts.py, change the COTS prompt as below

COT_REASONS_TEMPLATE = \
"""
Please provide answer in JSON formate using the below template.

TEMPLATE: 
{Score: <The score 0-10 based on the given criteria>,
Criteria: <Provide the criteria for this evaluation>,
Supporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>}
"""

change feedback->provider->base.py as

def generate_score_and_reasons(
    self,
    system_prompt: str,
    user_prompt: Optional[str] = None,
    normalize: float = 10.0
) -> Tuple[float, Dict]:
    """
    Generator and extractor for LLM prompts. It will look for
    "Supporting Evidence" template.

    Args:
        system_prompt (str): A pre-formated system prompt

    Returns:
        The score (float): 0-1 scale and reason metadata (dict) if available.
    """
    llm_messages = [{"role": "system", "content": system_prompt}]
    if user_prompt is not None:
        llm_messages.append({"role": "user", "content": user_prompt})

    response = self.endpoint.run_me(
        lambda: self._create_chat_completion(messages=llm_messages)
    )
    response = ast.literal_eval(response)
    score = response["Score"]
    criteria = response["Criteria"]
    supporting_evidence = response["Supporting Evidence"]
    print("supporting evidence ", supporting_evidence)
    # if "Supporting Evidence" in response:
    #     score = -1
    #     supporting_evidence = None
    #     criteria = None
    #     for line in response.split('\n'):
    #         if "Score" in line:
    #             score = re_0_10_rating(line) / normalize
    #         if "Criteria" in line:
    #             parts = line.split(":")
    #             if len(parts) > 1:
    #                 criteria = ":".join(parts[1:]).strip()
    #         if "Supporting Evidence" in line:
    #             print("line is ", line)
    #             supporting_evidence = line[
    #                 line.index("Supporting Evidence:") +
    #                 len(line):].strip()
    #                 # len("Supporting Evidence:"):].strip()
    # reasons = {
    #         'reason':
    #             (
    #                 f"{'Criteria: ' + str(criteria)}\n"
    #                 f"{'Supporting Evidence: ' + str(supporting_evidence)}"
    #             )
    #     }
    reasons = {
        'reason':[
            {'Criteria':str(criteria)},
            {'Supporting Evidence': str(supporting_evidence)}
        ]
    }
    return score, reasons

Idea here is to use a JSON output parser. GPT-4 is trained to provide efficient JSON output, we can leverage that capability to make this more deterministic and efficient.

joshreini1 commented 9 months ago

Thanks @rajib76 ! Improving the comprehensiveness prompting in this PR: https://github.com/truera/trulens/pull/901

please give it a try if you get a chance!

joshreini1 commented 9 months ago

Hey @rajib76 - this is released in 0.23.0!

rajib76 commented 9 months ago

Thanks Josh, will try it now.

On Fri, Feb 16, 2024 at 1:09 PM Josh Reini @.***> wrote:

Hey @rajib76 https://github.com/rajib76 - this is released in 0.23.0!

— Reply to this email directly, view it on GitHub https://github.com/truera/trulens/issues/889#issuecomment-1949339289, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD4VIRELLBLTICYKT5GIBXDYT7DJ3AVCNFSM6AAAAABDDXP4M2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNBZGMZTSMRYHE . You are receiving this because you were mentioned.Message ID: @.***>