truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.12k stars 182 forks source link

Groundedness evaluating sentences that don't exist in the LLM response #1059

Closed elizabethbmj closed 5 months ago

elizabethbmj commented 6 months ago

Bug Description Groundedness is low, even when it looks like it should be reasonable. In order to investigate this, I've been look to see how it is groundedness is/was calculated. It seems like the STATEMENTs that are being evaluated are not all in the actual response given. Here is my response: """For a patient presenting with acute heart failure, especially in cases of pulmonary edema or cardiogenic shock, initial management steps may include:

  1. Ensuring maintenance of adequate oxygenation.
  2. Maintaining a patent airway.
  3. Implementing a low salt diet.
  4. Restricting daily fluid intake.
  5. Controlling precipitating factors such as pain and agitation.
  6. Considering venous thromboembolism prophylaxis for all patients.
  7. Administering intravenous iron supplementation to patients with acute heart failure and reduced ejection fraction who are iron deficient.
  8. In cases of persistent cardiogenic shock despite inotropic therapy, temporary mechanical circulatory support (MCS) devices like extracorporeal membrane oxygenation or intra-aortic balloon pump may be considered.

Remember to individualize the management based on the patient's specific clinical presentation and needs. """

Here are the statements being evaluated: """STATEMENT 0: Statement Sentence: For a patient presenting with acute heart failure, especially in cases of pulmonary edema or cardiogenic shock, initial management steps may include: Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 1: Statement Sentence: Ensuring maintenance of adequate oxygenation. Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 2: Statement Sentence: The treatment algorithm for Acute heart failure includes recommendations for supportive care and venous thromboembolism prophylaxis. Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 3: Statement Sentence: Maintaining a patent airway. Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 4: Statement Sentence: The treatment algorithm for acute heart failure includes supportive care such as maintenance of adequate oxygenation, patent airways, low salt diet, and restriction of daily fluid intake. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 5: Statement Sentence: Implementing a low salt diet. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 6: Statement Sentence: The treatment algorithm for acute heart failure includes supportive care such as maintenance of oxygenation, low salt diet, and fluid intake restriction. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 7: Statement Sentence: Restricting daily fluid intake. Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 8: Statement Sentence: The treatment algorithm for acute heart failure includes supportive care such as maintenance of adequate oxygenation, patent airways, low salt diet, and restriction of daily fluid intake. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 9: Statement Sentence: Controlling precipitating factors such as pain and agitation. Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 10: Statement Sentence: The treatment algorithm for acute heart failure includes supportive care such as maintenance of adequate oxygenation, low salt diet, and restriction of daily fluid intake. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 11: Statement Sentence: Considering venous thromboembolism prophylaxis for all patients. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 12: Statement Sentence: The treatment algorithm for acute heart failure includes recommendations for venous thromboembolism prophylaxis in all patients. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 13: Statement Sentence: Administering intravenous iron supplementation to patients with acute heart failure and reduced ejection fraction who are iron deficient. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 14: Statement Sentence: The treatment algorithm for acute heart failure includes supportive care such as maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 25: Timeframe: acute. Patient group: ('hypertensive crisis', ''). Treatment: SUPPORTIVE CAREContinued supportive care includes maintenance of adequate oxygenation, patent airways, a low salt diet, and restriction of daily fluid intake.Precipitating factors such as pain and agitation should also be controlled.Venous thromboembolism prophylaxis is recommended in all patients.Patients with acute heart failure with reduced ejection fraction who are iron deficient should receive intravenous iron supplementation. Score: 10

STATEMENT 15: Statement Sentence: In cases of persistent cardiogenic shock despite inotropic therapy, temporary mechanical circulatory support (MCS) devices like extracorporeal membrane oxygenation or intra-aortic balloon pump may be considered. Supporting Evidence: The treatment_algorithm of Acute heart failure, part 19: Timeframe: acute. Patient group: ('hypotensive (systolic BP <90 mmHg)', ''). Treatment: TEMPORARY MECHANICAL CIRCULATORY SUPPORT (MCS)MCS devices (e.g., extracorporeal membrane oxygenation or intra-aortic balloon pump) should be considered in patients with persistent cardiogenic shock despite inotropic therapy. Score: 10

STATEMENT 16: Statement Sentence: Remember to individualize the management based on the patient's specific clinical presentation and needs. Supporting Evidence: NOTHING FOUND Score: 0 """

There is a miss-match between the actual LLM response and the statements evaluated. Also its pretty random whether it is finding supporting information or not. For instance STATEMENT 6 is almost a repeat of the previous statements and gets 10 for the score, whereas the previous statements get 0.

To Reproduce This is how I'm running it:

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

prompt = ChatPromptTemplate.from_messages([
    ("system", system_prompt),
    ("human", "{question}"),
    ("assistant", "Here is the context: {context}")
])

chain = (
    {"context": retriever_linked_chunks | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

openai = OpenAI()

context = App.select_context(chain)

grounded = Groundedness(groundedness_provider=OpenAI())

f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons)
    .on(context.collect()) # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

f_qa_relevance = Feedback(openai.relevance_with_cot_reasons).on_input_output()

f_context_relevance = (
    Feedback(openai.qs_relevance_with_cot_reasons)
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

tru_recorder = TruChain(chain,
    app_id='sub_questions',
    feedbacks=[f_qa_relevance, f_context_relevance, f_groundedness])

for i in sub_questions:
    with tru_recorder as recording:
        llm_response = chain.invoke(f"{i}")

rec = recording.get() 
recs = recording.records 

for feedback, feedback_result in rec.wait_for_feedback_results().items():
    print(feedback.name, feedback_result.result)

sub_q, feedback = tru.get_records_and_feedback(app_ids=['sub_questions'])

sub_q.to_csv('sub_question.csv', index=False)

tru.run_dashboard()

Expected behavior I would expect more consistency between the actual LLM response and the statements evaluated. I would also expect more consistency between scores of texts that are in the source documents

Relevant Logs/Tracebacks no tracebacks - just an observation that I haven't understood yet.

Environment:

dosubot[bot] commented 6 months ago

Hello @elizabethbmj! 👋 I'm Dosu, a bot here to help you solve bugs, answer questions, and guide you on how to become a contributor while you're waiting for a human maintainer. Think of me as your first point of contact for any issues or questions you might have. I'll be taking a look at your issue regarding the inconsistency in groundedness evaluation and will provide you with a full answer shortly. Stay tuned!

joshreini1 commented 6 months ago

Hey @elizabethbmj - since 0.18.1 we stopped relying on an LLM for the initial statement extraction which improves the reliability of this step in measuring groundedness. Can you upgrade and try again?

pip install -U trulens_eval

dosubot[bot] commented 6 months ago

Based on the conversation and the actions taken, it seems there was no specific action executed that required a direct output from me, Dosu. However, based on the conversation flow and the information provided by the maintainer, joshreini1, here's a synthesized response that could be helpful:


Hello @elizabethbmj,

Thank you for bringing this issue to our attention. It appears that the inconsistency you've encountered in the groundedness evaluation might be related to the version of TruLens you are using. As mentioned by joshreini1, since version 0.18.1, there have been significant improvements in the process, particularly with the initial statement extraction phase, which now no longer relies on an LLM. This change aims to enhance the reliability of the groundedness measurement.

To potentially resolve the issues you're experiencing, we recommend upgrading to the latest version of TruLens. You can do this by running the following command in your environment:

pip install -U trulens_eval

After upgrading, please try running your evaluation again to see if the consistency in groundedness evaluation improves. If you continue to encounter issues or have any more questions, feel free to reach out. We're here to help!

Best regards, Dosu

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
elizabethbmj commented 6 months ago

Thanks @joshreini1 That is very helpful. I upgraded to 0.27.1 and the sentence structure between the LLMs response and the evaluating STATEMENT is preserved.

I have noticed there is still a little inconsistency between the evaluation though, see below for example:

STATEMENT 0: Statement Sentence: When considering treating a moderately depressed child with SSRIs, it is important to note that SSRIs have been associated with a greater incidence of adverse effects, including serious ones like suicidality. Supporting Evidence: NOTHING FOUND Score: 0

STATEMENT 3: Statement Sentence: It is also crucial to set realistic expectations for the child and caregiver as not all children with depression will respond to SSRIs. Supporting Evidence: The treatment_algorithm of Depression in children, part 7.2: treated with ssris reported a greater incidence of adverse effects than those treated with placebo, including serious adverse effects such as suicidality, emphasizing the need for a careful and individualized harm - benefit analysis prior to treatment. when discussing the use of an ssri for treatment of depression, it is important to set realistic expectations for the child and caregiver. not all children with depression will respond to an ssri. Score: 10

The supporting evidence for Statement1 is in the supporting evidence for statement3 so I'm not sure why the evaluation would say nothing found.

joshreini1 commented 5 months ago

Thanks @elizabethbmj - this could be related to the clause in STATEMENT 0: "When considering treating a moderately depressed child with SSRIs" as the supporting evidence for Statement 3 doesn't mention specific considerations for a child with moderate depression.

That being said, this may be further improved by a recent change to the groundedness prompt that no longer requires an exact match for supporting evidence (https://github.com/truera/trulens/pull/1112). To try, feel free to install from main or wait until the next trulens-eval release. Thanks!