truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.16k stars 189 forks source link

[BUG][NEED HELP] New astream is slow and has hard to understand Records for feedback-functions #1495

Closed tomdgr closed 1 month ago

tomdgr commented 1 month ago

Bug Description Issue number 1:

Is there a way to figure out how to correctly pass the on_input(), on() and on_ouput() to the feedback-functions in general? for .invoke it was Select.Record.app.middle[0].invoke.rets to get the context, on_input() got the answer, and on_output() got the response. However this is now not the same for chain.astream


    Feedback(feedback_groundedness_function, name="Groundedness")
    .on(Select.Record.app.middle[0].invoke.rets)
    .on_output()
    .aggregate(np.mean)
)

This is maybe different for different langcain-chains, so here is mine added:


async def get_chain(conversation: Conversation, prompt_type: str = "RESPONSE") -> AsyncIterator[str]:
    """
    Asynchronously creates a language model chain and streams the generated responses.

    Args:
        conversation (Conversation): The current conversation context containing messages and other details.
        prompt_type (str): The type of prompt to use, either "RESPONSE" or another supported prompt type. Defaults to "RESPONSE".

    Returns:
        RunnableSequence: The final chain object that can be used for streaming responses.
    """
    hyper_params = initialize_hyperparameters(
                )
    retriever = initialize_retriever(
                    hyper_params=hyper_params,
                    conversation=conversation
                )
    prompt = construct_prompt(
                    conversation=conversation,
                    prompt_type=prompt_type,
                    hyper_params=hyper_params
                )
    llm = initialize_llm(
                    hyper_params=hyper_params
                )

    # context chain
    context_chain = RunnableLambda(lambda x: retriever.get_relevant_documents(x["conversation"]))

    # query chain
    query_chain = RunnableLambda(lambda x: create_query(x["conversation"]))

    # main chain
    final_chain = (
        RunnableParallel(
            {
                "context": context_chain,
                "query": query_chain,
                "history": itemgetter("history"),
            }
        )
        .assign(response=prompt | llm)
        .pick(["response", "context"])
    )

    return final_chain

which results in the following output from chain = await get_chain(conversationt=conversation,prompt_type="RESPONSE"):

{
  context: RunnableLambda(lambda x: retriever.get_relevant_documents(x['conversation'])),
  query: RunnableLambda(lambda x: create_query(x['conversation'])),
  history: RunnableLambda(itemgetter('history'))
}
| RunnableAssign(mapper={
    response: PromptTemplate(input_variables=['context', 'history', 'query'], template='"""Answer the question\n  <query>\n{query}\n</query>\n based only on the following context:\n<context>\n{context}\n</context>\n and the following conversation history you previously had:\n<history>\n  {history}\n  </history>\n- Provide a highly detailed, well-organized and comprehensive answer."\n- Include relevant criteria, obligations, procedures, and limitations to ensure clarity and completeness.\n- Highlight the importance of these provisions for parties if there are any.\n- References sources to ensuring clarity.\n- Ensures that all relevant aspects of the question are covered comprehensively.\n- You can also use general knowledge if it is helpful.\n- Answer in the same language as the user quesiton.\n- Add the file url you use from the context\n"""\n')
              | AzureChatOpenAI(verbose=True, callbacks=<langchain_core.callbacks.manager.CallbackManager object at 0x0000029DD45D2110>, client=<openai.resources.chat.completions.Completions object at 0x0000029DD45D2B90>, async_client=<openai.resources.chat.completions.AsyncCompletions object at 0x0000029DD450A2D0>, root_client=<openai.lib.azure.AzureOpenAI object at 0x0000029DD3966CD0>, root_async_client=<openai.lib.azure.AsyncAzureOpenAI object at 0x0000029DD454F010>, temperature=0.3, streaming=True, deployment_name='gpt4o', openai_api_version='2023-12-01-preview')
  })
| RunnablePick(keys=['response', 'context'])

So the main question is: how do I figure out how to properly define my feedback-functions based on a langchain? The examples provided in examples/notebooks/ was to simple and I also could not find any in the documentation

Issue number 2:

To Reproduce

import sys
import os
sys.path.append(os.path.abspath(os.path.join('..', '..')))
from app.utils.langchain_funcs import get_chain
from app.datastructures import Conversation, Message
from langchain_core.messages.ai import AIMessageChunk

from langchain.memory import ConversationBufferMemory
# Your existing AzureOpenAI provider setup
openai_provider = AzureOpenAI(
    deployment_name="gpt4o2",
    endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
    api_version=AZURE_OPENAI_VERSION
)

tru = TruSession()
tru.reset_database()
tru.start_dashboard()

def feedback_cot_function(input_data, output_data):
    cot_relevance = openai_provider.relevance_with_cot_reasons(input_data, output_data)
    return cot_relevance

f_feedback_cot = Feedback(feedback_cot_function, name="Answer Relevance COT").on_input().on_output()

def feedback_groundedness_function(context, response):
    groundedness = openai_provider.groundedness_measure_with_cot_reasons(context, response)
    return groundedness

f_groundedness = (
    Feedback(feedback_groundedness_function, name="Groundedness")
    .on(Select.Record.app.middle[0].invoke.rets)
    .on_output()
    .aggregate(np.mean)
)

f_context_relevance = (
    Feedback(openai_provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(Select.Record.app.middle[0].invoke.rets)
    .aggregate(np.mean)
)
# ------------------------------------------------------------------
# Context Relevance
# ------------------------------------------------------------------
f_context_relevance = (
    Feedback(openai_provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(Select.Record.app.middle[0].rets)
    .aggregate(np.mean)
)

# ------------------------------------------------------------------
# Main trigger
# ------------------------------------------------------------------
conversation = Conversation(
    title="Ny Samtale",
    id="conversation-1234",
    filename="sample_file.txt",
    indexName="livsvitenskap",
    promptType="juridisk ekspert",
    messages=[
        Message(role="user", content="Hei, hva er de juridiske implikasjonene for saken?")
    ]
)
chain = await get_chain(conversation=conversation,prompt_type="RESPONSE")

memory = ConversationBufferMemory(memory_key="history", return_messages=False)
# Populate memory with past conversation messages
for message in conversation.messages:
    if message.role == "user":
        memory.chat_memory.add_user_message(message.content)
    elif message.role == "assistant":
        memory.chat_memory.add_ai_message(message.content)
history = memory.load_memory_variables({})["history"]
bot_reply = ""

tru_recorder = TruChain(
    app_name = "testapp",
    app=chain,
    provider=openai_provider,
    feedbacks = [f_feedback_cot]
)
result=""
async with tru_recorder as recording:

    async for chunk in chain.astream({"conversation": conversation, "history": history}):
        if "response" in chunk:
            content = chunk["response"]

            # Ensure content is an AIMessageChunk or compatible structure
            if isinstance(content, AIMessageChunk):
                result += content.content  # Collect full bot reply
            else:
                raise Exception("Something went wrong in yield chunk in chain response, try opening a new chat")

Expected behavior Groundedness, Context Relevance and Answer Relevance all with COT in dashboard, comparably fast to chain.invoke, when calling chain.astream instead.

Relevant Logs/Tracebacks Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.

Environment:

3.11.8 python trulens 1.0.1 trulens-apps-langchain 1.0.1 trulens-core main ON PR #1487 trulens-dashboard 1.0.1 trulens_eval 1.0.1 trulens-feedback 1.0.1 trulens-providers-openai 1.0.1 langchain 0.2.16 langchain-cli 0.0.31 langchain-community 0.2.17 langchain-core 0.2.40 langchain-openai 0.1.25 langchain-text-splitters 0.2.4

Additional context The datastructures used can be found in my previous issue #1487

dosubot[bot] commented 1 month ago

Issue 1: Defining Feedback Functions for chain.astream

When using chain.astream, the structure of the data and the way you access it can differ from chain.invoke. Here’s how you can correctly define your feedback functions:

  1. Understanding the Data Structure:

    • For chain.invoke, you used Select.Record.app.middle[0].invoke.rets to get the context.
    • For chain.astream, you need to adapt the selectors to match the streaming data structure.
  2. Defining Feedback Functions:

    • Use on_input() to specify the main app input.
    • Use on_output() to specify the main app output.
    • Use on() to specify other parts of the app record.

Here’s an example of how you can define your feedback functions for chain.astream:

from trulens_eval import Feedback, Select
import numpy as np

# Define feedback functions
def feedback_groundedness_function(context, response):
    groundedness = openai_provider.groundedness_measure_with_cot_reasons(context, response)
    return groundedness

f_groundedness = (
    Feedback(feedback_groundedness_function, name="Groundedness")
    .on(Select.Record.app.middle[0].rets)  # Adjust selector for astream
    .on_output()
    .aggregate(np.mean)
)

f_context_relevance = (
    Feedback(openai_provider.context_relevance_with_cot_reasons, name="Context Relevance")
    .on_input()
    .on(Select.Record.app.middle[0].rets)  # Adjust selector for astream
    .aggregate(np.mean)
)

f_feedback_cot = Feedback(feedback_cot_function, name="Answer Relevance COT").on_input().on_output()

Issue 2: Slow astream Output

The astream method can be slower due to several potential bottlenecks:

  1. Network Latency: Streaming data can be affected by network latency.
  2. Token and Cost Tracking: Managing and processing chunks can introduce overhead.
  3. Garbage Collection: Memory leaks or inefficient garbage collection can slow down the process.
  4. Callback Handling: Inefficient management of callbacks can introduce delays.
  5. Model Configuration: Ensure the model and its configuration are optimized for streaming.
  6. Latency in LLM API Calls: Delays in API calls can impact performance.

To mitigate these issues, consider the following:

Example Usage with TruChain

Here’s an example of how to use TruChain with the defined feedback functions and chain.astream:

from trulens_eval import TruChain, TruSession
from app.utils.langchain_funcs import get_chain
from app.datastructures import Conversation, Message
from langchain.memory import ConversationBufferMemory
from langchain_core.messages.ai import AIMessageChunk

# Initialize TruSession
tru = TruSession()
tru.reset_database()
tru.start_dashboard()

# Define conversation
conversation = Conversation(
    title="Ny Samtale",
    id="conversation-1234",
    filename="sample_file.txt",
    indexName="livsvitenskap",
    promptType="juridisk ekspert",
    messages=[Message(role="user", content="Hei, hva er de juridiske implikasjonene for saken?")]
)

# Get chain
chain = await get_chain(conversation=conversation, prompt_type="RESPONSE")

# Initialize memory
memory = ConversationBufferMemory(memory_key="history", return_messages=False)
for message in conversation.messages:
    if message.role == "user":
        memory.chat_memory.add_user_message(message.content)
    elif message.role == "assistant":
        memory.chat_memory.add_ai_message(message.content)
history = memory.load_memory_variables({})["history"]

# Initialize TruChain with feedback functions
tru_recorder = TruChain(
    app_name="testapp",
    app=chain,
    provider=openai_provider,
    feedbacks=[f_feedback_cot, f_groundedness, f_context_relevance]
)

# Stream responses
result = ""
async with tru_recorder as recording:
    async for chunk in chain.astream({"conversation": conversation, "history": history}):
        if "response" in chunk:
            content = chunk["response"]
            if isinstance(content, AIMessageChunk):
                result += content.content
            else:
                raise Exception("Something went wrong in yield chunk in chain response, try opening a new chat")

This setup should help you correctly define and use feedback functions with chain.astream and address the performance issues.

To continue talking to Dosu, mention @dosu.

tomdgr commented 1 month ago

Still need this to be looked at.

sfc-gh-jreini commented 1 month ago

@tomdgr see you marked this as completed. Can you share what the resolution was? Thanks!

tomdgr commented 1 month ago

Hi @sfc-gh-jreini. There is no solution yet, but I have come alot further.

here is a general update on what has happened

  1. I had to replace all trulens package-folders correctly (still not quite sure I have done it 100% correctly)
  2. I had to rewrite the langchain-chain, to reduce complexity. This led to Answer Relevance working.
  3. Still very slow, and still hard to figure out RecordCalls Here is some issues I am facing
    STDERR
    2024-09-24 15:01:10.262 Uncaught app exception
    Traceback (most recent call last):
    File "C:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\streamlit\runtime\scriptrunner\exec_code.py", line 88, in exec_func_with_error_handling
    result = func()
             ^^^^^^
    File "C:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\streamlit\runtime\scriptrunner\script_runner.py", line 590, in code_to_exec
    exec(code, module.__dict__)
    File "C:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\dashboard\Leaderboard.py", line 839, in <module>
    render_leaderboard(app_name)
    File "C:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\dashboard\Leaderboard.py", line 814, in render_leaderboard
    _render_grid_tab(
    File "C:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\dashboard\Leaderboard.py", line 494, in _render_grid_tab
    grid_data.event_data
    AttributeError: 'AgGridReturn' object has no attribute 'event_data'

image

  1. Context part of my code after rewriting chain is the main research area for me at the moment. I tihnk I want to try something like : https://www.trulens.org/trulens/evaluation/feedback_selectors/selecting_components/ to fix it. However that requires further rewrite of my chain.

Here is the new chain:

from langchain_core.output_parsers.string import StrOutputParser
from langchain_openai import AzureChatOpenAI
from app.config import (AZURE_OPENAI_API_KEY,
                      AZURE_OPENAI_ENDPOINT,
                      AZURE_OPENAI_VERSION,
                      AZURE_SEARCH_ENDPOINT,
                      AZURE_SEARCH_KEY
                      )
async def get_chain(conversation: Conversation, prompt_type: str = "RESPONSE") -> AsyncIterator[str]:
    """
    Asynchronously creates a language model chain and streams the generated responses.

    Args:
        conversation (Conversation): The current conversation context containing messages and other details.
        prompt_type (str): The type of prompt to use, either "RESPONSE" or another supported prompt type. Defaults to "RESPONSE".

    Returns:
        RunnableSequence: The final chain object that can be used for streaming responses.
    """
    hyper_params = initialize_hyperparameters()
    retriever = initialize_retriever(
        hyper_params=hyper_params,
        conversation=conversation
    )
    prompt = construct_prompt(
        conversation=conversation,
        prompt_type=prompt_type,
        hyper_params=hyper_params
    )
    llm = AzureChatOpenAI(
        deployment_name=hyper_params.get("deployment_name"),
        openai_api_key=AZURE_OPENAI_API_KEY,
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_version=hyper_params.get("azure_openai_version"),
        temperature=hyper_params.get("temperature"),
        streaming=hyper_params.get("streaming")
    )

    # context chain
    context_chain = RunnableLambda(lambda x: retriever.get_relevant_documents(x["conversation"]))

    # query chain
    query_chain = RunnableLambda(lambda x: create_query(x["conversation"]))

    # Create an instance of StrOutputParser
    str_parser = StrOutputParser()

    # main chain
    parallel_chain = (
        RunnableParallel(
            {
                "context": context_chain,
                "query": query_chain,
                "history": itemgetter("history"),
            }
        )

    )
    final_chain = (
        parallel_chain
        | prompt 
        | llm 
        | str_parser
    )
    return final_chain
sfc-gh-chu commented 1 month ago

Hi @tomdgr regarding the aggrid issue, can you verify you're using streamlit-aggrid>=1.0.5 which should resolve that issue? It's possible while reinstalling trulens there may have been a dependency version mismatch.

tomdgr commented 1 month ago

hi @sfc-gh-chu. I managed to solve that by updating the package.

c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\feedback\llm_provider.py:289: UserWarning: No supporting evidence provided. Returning score only.
  warnings.warn(
Error removing trivial statements: unterminated string literal (detected at line 21) (<string>, line 21). Proceeding with all statements.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3 based on the overlap of information. Please provide a specific statement for evaluation.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3. Please provide a specific statement for evaluation.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3 based on the criteria.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3 based on the criteria.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3. Please provide a specific statement for evaluation.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3 based on the overlap of information. Please provide a specific statement for evaluation.
Multiple valid rating values found in the string: I'm sorry, I can only provide a score from 0 to 3.
Feedback Function exception caught: Traceback (most recent call last):
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\core\feedback\feedback.py", line 895, in run
    mod_base_endpoint.Endpoint.track_all_costs_tally(
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\core\feedback\endpoint.py", line 516, in track_all_costs_tally
    result, cbs = Endpoint.track_all_costs(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\core\feedback\endpoint.py", line 489, in track_all_costs
    return Endpoint._track_costs(
           ^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\core\feedback\endpoint.py", line 581, in _track_costs
    result: T = __func(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\tomg\AppData\Local\Temp\ipykernel_24300\1291823463.py", line 40, in feedback_groundedness_function
    groundedness = openai_provider.groundedness_measure_with_cot_reasons(context, response)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\feedback\llm_provider.py", line 1584, in groundedness_measure_with_cot_reasons
    results.append(future.result())
                   ^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 449, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\_base.py", line 401, in __get_result
    raise self._exception
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\concurrent\futures\thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\core\utils\python.py", line 633, in _future_target_wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\feedback\llm_provider.py", line 1566, in evaluate_hypothesis
    score, reason = self.generate_score_and_reasons(
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\feedback\llm_provider.py", line 282, in generate_score_and_reasons
    re_configured_rating(
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\feedback\generated.py", line 73, in re_configured_rating
    raise ParseError("int or float number", s, pattern=PATTERN_NUMBER)
trulens.feedback.generated.ParseError: Tried to find int or float number using pattern ([+-]?[0-9]+\.[0-9]*|[1-9][0-9]*|0) in
  I'm sorry, I can't assist with that.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "c:\Users\tomg\AppData\Local\Programs\Python\Python311\Lib\site-packages\trulens\core\feedback\feedback.py", line 910, in run
    raise RuntimeError(
RuntimeError: Evaluation of Groundedness failed on inputs: 
{'context': 'some string'

Is the only thing I hav enot been able to fix yet,


def feedback_groundedness_function(context, response):
    groundedness = openai_provider.groundedness_measure_with_cot_reasons(context, response)
    return groundedness

f_groundedness = (
    Feedback(feedback_groundedness_function, name="Groundedness")
    .on(Select.Record.app.middle[0].astream.args.input.context)
    .on_output()
)

I am currently using:

retriever = AzureAISearchRetriever(
        service_name = AZURE_SEARCH_ENDPOINT,
        api_key = AZURE_SEARCH_KEY,
        index_name = "livsvitenskap",
        top_k=top_k,
        content_key = "content"
    )
context = TruChain.select_context(chain)
context
#gives otuput;
Lens().__record__.app.first.steps__.context.first.invoke.rets[:].page_content
# however this is not in Records

image

sfc-gh-dhuang commented 1 month ago

@tomdgr Hi Tom,

could you try either updating the groundedness feedback definition by changing one param use_sent_tokenize to True

def feedback_groundedness_function(context, response):
    groundedness = openai_provider.groundedness_measure_with_cot_reasons(context, response, use_sent_tokenize=True)
    return groundedness

?

Alternatively, the latest TruLens source on the main branch should work w/o the argument use_sent_tokenize set explicitly to True.

My guess based on the stacktrace is that our groundedness feedback function was not able to get the context as source and generate proper feedback scores, because of the previous released version of TruLens was using LLM to perform sentence splitting and lead to hallucination (this happen as intermediate steps in groudnedss_measure_with_cot_reasons)

Let me know if this works and happy to dive deeper if not. thanks!

tomdgr commented 1 month ago

@sfc-gh-dhuang Thank you for the response, I will take a look at this and update you asap

sfc-gh-jreini commented 1 month ago

Hey @tomdgr - any success?

sfc-gh-pdharmana commented 1 month ago

Closing this for now @tomdgr pls reopen if you need more help