truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
1.84k stars 163 forks source link

[FEAT] Running on existing data #1215

Open funkjo opened 3 weeks ago

funkjo commented 3 weeks ago

Feature Description Easier way to obtain feedback results when running feedbacks through virtual recorder and virtual app

Reason I am following the documentation here and after running

for record in data:
    virtual_recorder.add_record(rec)

I am unable to figure out how to obtain the results of the question relevance feedback function

Importance of Feature The value this unlocks is the ability to very easily run feedback functions on large amounts of data after they have been logged, rather than during runtime of LLM application.

sfc-gh-jreini commented 3 weeks ago

Hey @funkjo - you can wait for computation and then display feedback results using the following:

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)

Does that help?

funkjo commented 2 weeks ago

@sfc-gh-jreini thanks for the comment. Not seeing any results when trying that out. Code runs but prints nothing.

Here is the full code FYI:

import pandas as pd

data = {
    'prompt': ['Where is Germany?', 'What is the capital of France?'],
    'response': ['Germany is in Europe', 'The capital of France is Paris'],
    'context': ['Germany is a country located in Europe.', 'France is a country in Europe and its capital is Paris.']
}
df = pd.DataFrame(data)
df.head()

from trulens_eval import Select
from trulens_eval.tru_virtual import VirtualApp
from trulens_eval.tru_virtual import VirtualRecord

virtual_app = dict(
    llm=dict(
        modelname="some llm component model name"
    ),
    template="information about the template I used in my app",
    debug="all of these fields are completely optional"
)

virtual_app = VirtualApp(virtual_app) # can start with the prior dictionary
virtual_app[Select.RecordCalls.llm.maxtokens] = 1024

retriever_component = Select.RecordCalls.retriever
virtual_app[retriever_component] = "this is the retriever component"

context_call = retriever_component.get_context

data_dict = df.to_dict('records')

data = []

for record in data_dict:
    rec = VirtualRecord(
        main_input=record['prompt'],
        main_output=record['response'],
        calls=
            {
                context_call: dict(
                    args=[record['prompt']],
                    rets=[record['context']]
                )
            }
        )
    data.append(rec)

from trulens_eval.feedback.provider import OpenAI
from trulens_eval.feedback.feedback import Feedback
from trulens_eval.feedback.provider.openai import AzureOpenAI as fAzureOpenAI
from azure.identity import EnvironmentCredential
from dotenv import load_dotenv
load_dotenv()
import os

# Get token so you don't need API key anymore in env
def getToken():
    """ gets the AzureAD token from AzureAD """
    credential = EnvironmentCredential()
    token = credential.get_token("get token url")
    print("token acquired ")
    return token

fopenai = fAzureOpenAI(
                        api_key=getToken().token,
                        azure_deployment=os.environ["CHATGPT_DEPLOYMENT"],
                        azure_endpoint=os.environ["OPENAI_URL"],
                        deployment_name=os.environ["CHATGPT_MODEL"],
                        api_version=os.environ["OPENAI_API_VERSION"]
)

# Select context to be used in feedback. We select the return values of the
# virtual `get_context` call in the virtual `retriever` component. Names are
# arbitrary except for `rets`.
context = context_call.rets[:]

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(fopenai.qs_relevance)
    .on_input()
    .on(context)

from trulens_eval.tru_virtual import TruVirtual

virtual_recorder = TruVirtual(
    app_id="a virtual app",
    app=virtual_app,
    feedbacks=[f_context_relevance]
)

temp = []

for record in data:
    result = virtual_recorder.add_record(record=record, feedback_mode=f_context_relevance)
    temp.append(result)

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)
    print('hello')
funkjo commented 2 weeks ago

@sfc-gh-jreini when I print the output of add_record that I saved in a temp list, I see the following:

[VirtualRecord(record_id='record_hash_77668b896f8e4bbb87eb94da85fa8a2e', app_id='a virtual app', cost=Cost(n_requests=0, n_successful_requests=0, n_classes=0, n_tokens=0, n_stream_chunks=0, n_prompt_tokens=0, n_completion_tokens=0, cost=0.0), perf=Perf(start_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 202525), end_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 202526)), ts=datetime.datetime(2024, 6, 18, 13, 4, 55, 202525), tags='', meta=None, main_input='Where is Germany?', main_output='Germany is in Europe', main_error=None, calls=[RecordAppCall(call_id='85f88090-6962-4db7-9bb1-b97bc3c51ecd', stack=[RecordAppCallMethod(path=Lens(), method=Method(obj=Obj(cls=trulens_eval.tru_virtual.VirtualApp, id=0, init_bindings=None), name='root')), RecordAppCallMethod(path=Lens().app.retriever, method=Method(obj=Obj(cls=trulens_eval.tru_virtual.VirtualApp, id=0, init_bindings=None), name='get_context'))], args=['Where is Germany?'], rets=['Germany is a country located in Europe.'], error=None, perf=Perf(start_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 202525), end_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 202526)), pid=0, tid=0), RecordAppCall(call_id='b123973a-4b42-4c96-85c4-65b01eeefec3', stack=[RecordAppCallMethod(path=Lens(), method=Method(obj=Obj(cls=trulens_eval.tru_virtual.VirtualApp, id=0, init_bindings=None), name='root'))], args=['Where is Germany?'], rets=['Germany is in Europe'], error=None, perf=Perf(start_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 202525), end_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 202526)), pid=0, tid=0)], feedback_and_future_results=None, feedback_results=None),
 VirtualRecord(record_id='record_hash_3a6ffcbb0f012c7572f6d463f409a9ca', app_id='a virtual app', cost=Cost(n_requests=0, n_successful_requests=0, n_classes=0, n_tokens=0, n_stream_chunks=0, n_prompt_tokens=0, n_completion_tokens=0, cost=0.0), perf=Perf(start_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 208946), end_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 208947)), ts=datetime.datetime(2024, 6, 18, 13, 4, 55, 208946), tags='', meta=None, main_input='What is the capital of France?', main_output='The capital of France is Paris', main_error=None, calls=[RecordAppCall(call_id='66de71b8-2931-4f3f-8d07-808b9b81263d', stack=[RecordAppCallMethod(path=Lens(), method=Method(obj=Obj(cls=trulens_eval.tru_virtual.VirtualApp, id=0, init_bindings=None), name='root')), RecordAppCallMethod(path=Lens().app.retriever, method=Method(obj=Obj(cls=trulens_eval.tru_virtual.VirtualApp, id=0, init_bindings=None), name='get_context'))], args=['What is the capital of France?'], rets=['France is a country in Europe and its capital is Paris.'], error=None, perf=Perf(start_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 208946), end_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 208947)), pid=0, tid=0), RecordAppCall(call_id='6db08e44-694e-4fcd-b4a6-49c15bf1b898', stack=[RecordAppCallMethod(path=Lens(), method=Method(obj=Obj(cls=trulens_eval.tru_virtual.VirtualApp, id=0, init_bindings=None), name='root'))], args=['What is the capital of France?'], rets=['The capital of France is Paris'], error=None, perf=Perf(start_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 208946), end_time=datetime.datetime(2024, 6, 18, 13, 4, 55, 208947)), pid=0, tid=0)], feedback_and_future_results=None, feedback_results=None)]

at the very end, feedback_results=None

sfc-gh-jreini commented 2 weeks ago

I think the issue here might be your setting of feedback mode. Feedback mode should be set when you set up the recorder, andf_context_relevance is not a valid value for this argument

for record in data:
    result = virtual_recorder.add_record(record=record, feedback_mode=f_context_relevance)
    temp.append(result)

To run feedbacks immediately when the record is added, this argument can be left out altogether.

for record in data:
    result = virtual_recorder.add_record(record=record)
    temp.append(result)

To run feedbacks in deferred mode, you can add it to the recorder setup.

virtual_recorder = TruVirtual(
    app_id="a virtual app",
    app=virtual_app,
    feedbacks=[f_context_relevance, f_groundedness, f_qa_relevance],
    feedback_mode = "deferred" # optional
)

for record in data:
    result = virtual_recorder.add_record(record=record)
    temp.append(result)

tru.start_evaluator()

Read more about feedback mode

funkjo commented 2 weeks ago

I made the suggested change and removed feedback_mode parameter in add_record

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)
    print('hello')

output: qs_relevance None hello

rec.wait_for_feedback_results().items() has following output:

dict_items([(FeedbackDefinition(qs_relevance,
    selectors={'question': Lens().__record__.main_input, 'context': Lens().__record__.app.retriever.get_context.rets[:]},
    if_exists=None
), qs_relevance (FeedbackResultStatus.FAILED) = None
)])
sfc-gh-jreini commented 1 week ago

Thanks @funkjo - was able to replicate this. Will work on a solution this week.

funkjo commented 1 week ago

thanks @sfc-gh-jreini any update on this?