truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.05k stars 177 forks source link

Generate TruLens eval report of generated Question answer context pairs. #846

Closed prasad4fun closed 7 months ago

prasad4fun commented 7 months ago

I have used Mixtral and mistral models for 1000 sampled questions. using RAG pipeline from Llama index have retrieved contexts and generated answer for these 1k questions. Is it possible to evaluate RAG Triage metric for this existing data frame of questions context and answer columns using GPT 3.5 as an evaluator?

joshreini1 commented 7 months ago

Hey @prasad4fun ! Definitely. Check out these docs for running on existing data

kmr666 commented 2 months ago

Hey @prasad4fun ! Definitely. Check out these docs for running on existing data

I already have the questions and the context retrieved by RAG. Can I directly use Trulens for evaluation? Where can I find the documentation? I see that the document link you provided above is no longer valid.

sfc-gh-jreini commented 2 months ago

Hi @kmr666 - those docs have moved here

kmr666 commented 2 months ago

Hi @kmr666 - those docs have moved here

Hello, thank you very much for your reply and help. I have another question to consult you with. I already have a database with multiple entries of question, context, and answer. However, my context is given in the form of a nested array, such as [[], [], []). Do I need to expand the content in the context into a list containing a string? Or is there another method? Could you tell me how to input my own data? Thank you very much.

sfc-gh-jreini commented 2 months ago

Not sure if I am getting your data right, please let me know and share a sample data frame if not.

You can load data like this:

import pandas as pd

data = {
    'prompt': ['Where is Germany?', 'What is the capital of France?'],
    'response': ['Germany is in Europe', 'The capital of France is Paris'],
    'context': [['Germany is a country located in Europe',
                 'Germany lies between the Baltic and North Sea to the North and Alps to the South'], ['France is a country in Europe and its capital is Paris.',
                'Paris is the capital of France']]
}
df = pd.DataFrame(data)

Into TruLens virtual records like this:

from trulens_eval.tru_virtual import VirtualRecord
data_dict = df.to_dict('records')

data = []

for record in data_dict:
    rec = VirtualRecord(
        main_input=record['prompt'],
        main_output=record['response'],
        calls=
            {
                context_call: dict(
                    args=[record['prompt']],
                    rets=record['context']
                )
            }
        )
    data.append(rec)

from trulens_eval.tru_virtual import TruVirtual

virtual_recorder = TruVirtual(
    app_id="a virtual app",
    app=virtual_app)

for record in data:
    virtual_recorder.add_record(record)

Once you've done this, you can then see your app represented in the TruLens dashboard.

Screenshot 2024-07-11 at 10 22 26 AM

And see both retrieved context if you click on the get_context span.

Screenshot 2024-07-11 at 10 22 18 AM

The beauty of this approach is that it can be set up to mirror any arbitrary app structure. Let me know if you have more questions on this.

kmr666 commented 2 months ago

Not sure if I am getting your data right, please let me know and share a sample data frame if not.

You can load data like this:

import pandas as pd

data = {
    'prompt': ['Where is Germany?', 'What is the capital of France?'],
    'response': ['Germany is in Europe', 'The capital of France is Paris'],
    'context': [['Germany is a country located in Europe',
                 'Germany lies between the Baltic and North Sea to the North and Alps to the South'], ['France is a country in Europe and its capital is Paris.',
                'Paris is the capital of France']]
}
df = pd.DataFrame(data)

Into TruLens virtual records like this:

from trulens_eval.tru_virtual import VirtualRecord
data_dict = df.to_dict('records')

data = []

for record in data_dict:
    rec = VirtualRecord(
        main_input=record['prompt'],
        main_output=record['response'],
        calls=
            {
                context_call: dict(
                    args=[record['prompt']],
                    rets=record['context']
                )
            }
        )
    data.append(rec)

from trulens_eval.tru_virtual import TruVirtual

virtual_recorder = TruVirtual(
    app_id="a virtual app",
    app=virtual_app)

for record in data:
    virtual_recorder.add_record(record)

Once you've done this, you can then see your app represented in the TruLens dashboard.

Screenshot 2024-07-11 at 10 22 26 AM

And see both retrieved context if you click on the get_context span.

Screenshot 2024-07-11 at 10 22 18 AM

The beauty of this approach is that it can be set up to mirror any arbitrary app structure. Let me know if you have more questions on this.

Hello, thank you very much for your answers and help. My data is stored in a CSV file with three columns: question, context, and answer. The content is in Chinese text, and the data format is as follows:

data_samples = { 'question': ['xxx', 'xxx'], 'response': ['xxx', 'xxx'], 'contexts': [['xxx'], ['xxx']], } Here is a data instance as follows: image

Can you tell me if it's possible to use TruLens to evaluate an existing dataset of this type? Thank you very much, I look forward to your reply!

sfc-gh-jreini commented 2 months ago

TruLens would be a good fit for this because of its wide model support. I would recommend exploring LLMs to use as providers that have good Chinese language support.

If you encounter issues because of the prompting is in English, you can also use custom feedback functions and translate the feedback prompts to Chinese.

kmr666 commented 1 month ago

TruLens would be a good fit for this because of its wide model support. I would recommend exploring LLMs to use as providers that have good Chinese language support.

If you encounter issues because of the prompting is in English, you can also use custom feedback functions and translate the feedback prompts to Chinese.

Thank you for your help. I have another problem that I hope can get your help. When I was evaluating my own data, the background could not display Trace details, and I could not see the corresponding context information added. image My code is as follows:

for record in data_dict:
    rec = VirtualRecord(
        main_input=record['prompt'],
        main_output=record['response'],
        calls=
            {
                context_call: dict(
                    args=[record['prompt']],
                    rets=record['context']
                )
            }
        )
    data.append(rec)
context = context_call.rets[:]

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
    Feedback(fopenai.context_relevance)
    .on_input()
    .on(context)
)

from trulens_eval.tru_virtual import TruVirtual

virtual_recorder = TruVirtual(
    app_id="a virtual app",
    app=virtual_app,
    feedbacks=[f_context_relevance]
)

for rec in data:
    virtual_recorder.add_record(rec)
# Retrieve feedback results. You can either browse the dashboard or retrieve the
# results from the record after it has been `add_record`ed.

for rec in data:
    print(rec.main_input, "-->", rec.main_output, )

    for feedback, feedback_result in rec.wait_for_feedback_results().items():
        print("\t", feedback.name, feedback_result.result)

    print('finish')

response in my own dataset is currently null. How can I see the full data, including Trace details, as you have shown above?