truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.05k stars 178 forks source link

Outdated documentation for Groundedness #1149

Closed Arseniy-II closed 3 months ago

Arseniy-II commented 4 months ago

Bug Description Here is the reference for the old use of Groundedness

Expected behavior According to the recent updated it should be used differently

dosubot[bot] commented 4 months ago

To address the outdated documentation for Groundedness, we need to update it to reflect the recent changes introduced in the pull request Updated Groundedness Usage. Below is the updated documentation content that should replace the old references:


Groundedness Benchmark

Overview

The Groundedness functionality evaluates the consistency of summarized responses with the source texts. This is done using various feedback function providers such as OpenAI's GPT-3.5-turbo, GPT-4, and Huggingface.

Updated Usage

The recent update introduces new methods and classes for Groundedness. Here is how you can use the updated Groundedness functionality:

Importing Required Modules and Functions

from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set

Generating Test Cases

Tru().reset_database()
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval_test_100.json")
groundedness_golden_set = [next(test_cases_gen) for _ in range(100)]

Setting Environment Variables

import os
os.environ["OPENAI_API_KEY"] = "..."
os.environ["HUGGINGFACE_API_KEY"] = "..."

Benchmarking Groundedness Feedback Function Providers

Huggingface Provider

from trulens_eval.feedback.provider.hugs import Huggingface
huggingface_provider = Huggingface()
groundedness_hug = Groundedness(groundedness_provider=huggingface_provider)
f_groundedness_hug = Feedback(groundedness_hug.groundedness_measure, name="Groundedness Huggingface").on_input().on_output().aggregate(groundedness_hug.grounded_statements_aggregator)

def wrapped_groundedness_hug(input, output):
    return np.mean(list(f_groundedness_hug(input, output)[0].values()))

OpenAI GPT-3.5-turbo Provider

from trulens_eval.feedback.provider import OpenAI
groundedness_openai = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-3.5-turbo"))
f_groundedness_openai = Feedback(groundedness_openai.groundedness_measure, name="Groundedness OpenAI GPT-3.5").on_input().on_output().aggregate(groundedness_openai.grounded_statements_aggregator)

def wrapped_groundedness_openai(input, output):
    return f_groundedness_openai(input, output)[0]['full_doc_score']

OpenAI GPT-4 Provider

groundedness_openai_gpt4 = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-4"))
f_groundedness_openai_gpt4 = Feedback(groundedness_openai_gpt4.groundedness_measure, name="Groundedness OpenAI GPT-4").on_input().on_output().aggregate(groundedness_openai_gpt4.grounded_statements_aggregator)

def wrapped_groundedness_openai_gpt4(input, output):
    return f_groundedness_openai_gpt4(input, output)[0]['full_doc_score']

Creating Feedback Object for Mean Absolute Error

ground_truth = GroundTruthAgreement(groundedness_golden_set)
f_mae = Feedback(ground_truth.mae, name="Mean Absolute Error").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

Creating TruBasicApp Instances

tru_wrapped_groundedness_hug = TruBasicApp(wrapped_groundedness_hug, app_id="groundedness huggingface", feedbacks=[f_mae])
tru_wrapped_groundedness_openai = TruBasicApp(wrapped_groundedness_openai, app_id="groundedness openai gpt-3.5", feedbacks=[f_mae])
tru_wrapped_groundedness_openai_gpt4 = TruBasicApp(wrapped_groundedness_openai_gpt4, app_id="groundedness openai gpt-4", feedbacks=[f_mae])

Running the Groundedness Evaluation

for i in range(len(groundedness_golden_set)):
    source = groundedness_golden_set[i]["query"]
    response = groundedness_golden_set[i]["response"]
    with tru_wrapped_groundedness_hug as recording:
        tru_wrapped_groundedness_hug.app(source, response)
    with tru_wrapped_groundedness_openai as recording:
        tru_wrapped_groundedness_openai.app(source, response)
    with tru_wrapped_groundedness_openai_gpt4 as recording:
        tru_wrapped_groundedness_openai_gpt4.app(source, response)

Getting the Leaderboard

Tru().get_leaderboard(app_ids=[]).sort_values(by="Mean Absolute Error")

Additional Information

The trulens_eval/feedback/prompts.py file contains various templates and prompts used for evaluating groundedness and other feedback metrics. These include:

These templates and prompts are part of the feedback mechanisms that help in assessing the groundedness and other quality metrics of the responses.


This updated documentation should replace the old references to ensure users are guided correctly on how to use the new Groundedness functionality.

#### Details **Sources** - [trulens_eval/trulens_eval/tests/groundedness_benchmark.ipynb](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/tests/groundedness_benchmark.ipynb) - [trulens_eval/trulens_eval/tests/groundedness_smoke_tests.ipynb](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/tests/groundedness_smoke_tests.ipynb) - [trulens_eval/trulens_eval/feedback/prompts.py](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/feedback/prompts.py) - [trulens_eval/trulens_eval/feedback/provider/base.py](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/feedback/provider/base.py) - [trulens_eval/trulens_eval/feedback/v2/feedback.py](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/feedback/v2/feedback.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

joshreini1 commented 3 months ago

Fixed by https://github.com/truera/trulens/pull/1161