truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.01k stars 173 forks source link

[BUG] Bedrock provider always gives: "Groundedness per statement in source: 0%" #1150

Open Arseniy-II opened 3 months ago

Arseniy-II commented 3 months ago

Bug Description I'm following quickstart with langChain However, I want to use Bedrock as an LLM and as an evaluation provider. I'm able to get answer relevance and context relevance but groundedness doesn't appear on a dashboard and I'm getting "Groundedness per statement in source: 0%| "

To Reproduce Follow quickstart with few tweaks:

  1. instead of llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0) use from langchain_aws import ChatBedrock
  2. instead of embeddings = OpenAIEmbeddings() use from langchain_community.embeddings import BedrockEmbeddings
  3. instead of provider = OpenAI() use from trulens_eval.feedback.provider import Bedrock

Expected behaviour Groundedness should be present on a dashboard and should be higher than 0 for the question mentioned in the quickstart

Relevant Logs/Tracebacks

✅ In Groundedness, input source will be set to __record__.app.first.steps__.context.first.get_relevant_documents.rets.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In context_relevance_with_cot_reasons, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In context_relevance_with_cot_reasons, input context will be set to __record__.app.first.steps__.context.first.get_relevant_documents.rets .
Starting dashboard ...
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/Arseniy_Ivanov/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
Groundedness per statement in source:   0%|                                        | 0/3 [00:00<?, ?it/s]

Environment:

dosubot[bot] commented 3 months ago

The issue with the Bedrock provider always giving "Groundedness per statement in source: 0%" is likely due to the feedback function for groundedness not being correctly set up or the context not being properly collected and utilized.

Steps to Resolve

  1. Ensure Context Collection: Verify that the context is correctly collected and passed to the feedback function. Ensure that the context.collect() method is working as expected and that the context chunks are being properly utilized in the feedback function.

  2. Check Response Handling: Ensure that the response handling logic is correctly implemented and that the necessary metrics are being extracted and utilized.

  3. Verify Supporting Evidence: Ensure that the response from the model includes the required fields and that the parsing logic correctly extracts and processes this information.

Example Code Snippet

Here is an example of how you might modify the code to ensure that the context is correctly collected and passed to the feedback function:

from trulens_eval.feedback.provider import Bedrock
from trulens_eval.feedback import Feedback

# Initialize the Bedrock provider
provider = Bedrock()

# Define the groundedness feedback function
f_groundedness = (
    Feedback(provider.groundedness_measure_with_cot_reasons)
    .on(context.collect())  # Ensure context chunks are collected into a list
    .on_output()
)

# Ensure that the context is being correctly collected
context_chunks = context.collect()
if not context_chunks:
    raise ValueError("Context chunks are not being collected properly.")

# Pass the collected context to the feedback function
f_groundedness = f_groundedness.on(context_chunks).on_output()

Additional Debugging

To further debug the issue, you can add print statements or logging to verify that the context is being collected and passed correctly:

# Collect context chunks
context_chunks = context.collect()
print("Collected context chunks:", context_chunks)

# Check if context chunks are empty
if not context_chunks:
    raise ValueError("Context chunks are not being collected properly.")

# Pass the collected context to the feedback function
f_groundedness = f_groundedness.on(context_chunks).on_output()

By ensuring that the context is correctly collected and passed to the feedback function, you should be able to resolve the issue with the groundedness evaluation and obtain a score higher than 0%.

#### Details **Sources** - [trulens_eval/examples/quickstart/langchain_quickstart.ipynb](https://github.com/truera/trulens/blob/main/trulens_eval/examples/quickstart/langchain_quickstart.ipynb) - [trulens_eval/trulens_eval/feedback/provider/endpoint/bedrock.py](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/feedback/provider/endpoint/bedrock.py) - [trulens_eval/trulens_eval/feedback/provider/bedrock.py](https://github.com/truera/trulens/blob/main/trulens_eval/trulens_eval/feedback/provider/bedrock.py)
**About Dosu** This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Arseniy-II commented 3 months ago

I have collected context exactly like that and looks like it is collected properly

Arseniy-II commented 3 months ago

So far I figured out that generate_score_and_reasons methods fail silently. It would be very beneficial to have some error provided back if methods fail

Arseniy-II commented 3 months ago

in bedrock.py in method _create_chat_completion

for amazon.titan-text-express-v1 json.loads(response.get('body').read()) returns empty string

for 'anthropic.claude-3-sonnet-20240229-v1:0' json.loads(response.get('body').read()) returns empty string

I believe it is because the body is formatted in the wrong way It looks like this:

body = json.dumps(
    {
        "prompt": f"\n\nHuman:{messages_str}\n\nAssistant:",
        "temperature": 0,
        "top_p": 1,
        "max_tokens_to_sample": 4095
    }

But it should be like this:

bodyNew = json.dumps(
    {
        "system": "Respond as helpfull assistant", 
        "messages": [{"role": "user", "content": f"\n\nHuman:{messages_str}\n\nAssistant:"}], 
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 4095,
    }
)
joshreini1 commented 2 months ago

Thanks @Arseniy-II - will take a look