tascheidt commented 10 months ago

Can try with Phase LLM or Trulens per the DeepLearning.AI course

Triad of metrics is Query, Response, and Context

tascheidt commented 10 months ago

For Tru, need to set up functions for each metric:

from trulens_eval import OpenAI as fOpenAI provider = fOpenAI()

Answer relevance: from trulens_eval import Feedback f_qa_relevance = Feedback( provider.relevance_with_cot_reasons, name="Answer Relevance" ).on_input_output()

Context relevance:

from trulens_eval import TruLlama

context_selection = TruLlama.select_source_nodes().node.text

f_qs_relevance = ( Feedback(provider.qs_relevance_with_cot_reasons, name="Context Relevance") .on_input() .on(context_selection) .aggregate(np.mean) )

Groundedness:

grounded = Groundedness(groundedness_provider=provider) f_groundedness = ( Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness" ) .on(context_selection) .on_output() .aggregate(grounded.grounded_statements_aggregator) )

tascheidt commented 10 months ago

Reporting from https://learn.deeplearning.ai/building-evaluating-advanced-rag/lesson/3/rag-triad-of-metrics:

from trulens_eval import TruLlama from trulens_eval import FeedbackMode

tru_recorder = TruLlama( sentence_window_engine, app_id="App_1", feedbacks=[ f_qa_relevance, f_qs_relevance, f_groundedness ] )

eval_questions = [] with open('eval_questions.txt', 'r') as file: for line in file:

Remove newline character and convert to integer

    item = line.strip()
    eval_questions.append(item)

for question in eval_questions: with tru_recorder as recording: sentence_window_engine.query(question)

records, feedback = tru.get_records_and_feedback(app_ids=[]) records.head()

tru.get_leaderboard(app_ids=[])

tru.run_dashboard()

tascheidt / jdmgpt

Build testing approach #4

Remove newline character and convert to integer