[BUG] Inconsistent context relevance displayed status

kaoru-will commented 4 months ago

Bug Description The feedback function Context Relevance I am trying to use provides inconsistent answers. Sometimes it would give a successful Feedback, but most times it would give a Failed Feedback

To Reproduce Which steps should someone take to run into the same error? A small, reproducible code example is useful here. My Code

def get_prebuilt_trulens_recorder(query_engine, app_id):

    context = App.select_context(query_engine)

    f_answer_relevance = Feedback(
        provider.relevance_with_cot_reasons, name= "Answer_Relevance"
    ).on_input().on_output()

    # Define a groundedness feedback function
    f_groundedness = (
        Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
        .on(context.collect())
        .on_output()

    )
    # Question/statement relevance between question and each context chunk.

    f_context_relevance = (
        Feedback(provider.context_relevance_with_cot_reasons, name="Context_Relevance")
        .on_input()
        .on(context)
        .aggregate(np.mean)
    )

    f_context_relevance("what is the capital of France?","France is a unitary, semi-presidential republic with its capital in Paris, it's largest city and main cultural and commercial center.")

    # print(f_context_relevance)

    feedbacks = [f_answer_relevance, f_context_relevance, f_groundedness]

    tru_recorder = TruLlama(
        query_engine,
        app_id=app_id,
        feedbacks=feedbacks,

        )

    return tru_recorder


tru_recorder = get_prebuilt_trulens_recorder(indexes[collection_id],
                                             app_id="Test")

with tru_recorder as recording:
        window_response = indexes[collection_id].query(user_query)
        print(str(window_response))

    recs = recording.get()
    print("input: ", recs.main_input)
    print("output: ", recs.main_output)
    trulens_dictionary['input'] = recs.main_input
    trulens_dictionary['output'] = recs.main_output
    for feedback, feedback_result in recs.wait_for_feedback_results().items():

        # print(feedback.name, feedback_result.result)
        # print(feedback.name, feedback_result.calls)
        trulens_dictionary[feedback.name] = feedback_result.result
        trulens_dictionary[feedback.name + "_calls"] = json.dumps([ob.__dict__ for ob in feedback_result.calls])
        # trulens_dictionary[feedback.name + "_error"] = feedback_result.error
        # trulens_dictionary[feedback.name + "_def"] = feedback
        # if feedback.name == "Context_Relevance":
        print(feedback)
        print(feedback_result)

Expected behavior A clear and concise description of what you expected to happen. I have taken note that feedbacks dont immediately provide their respective result, which is why I had used wait_for_feedback_results so I can itirate per feedback result and get my supposed results

2024-06-25 18:23:24.443 DEBUG    Looking via __find_tracker; found <frame at 0x0000023AD1FEE510, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>
2024-06-25 18:23:24.444 DEBUG    Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.
2024-06-25 18:23:24.444 DEBUG    Handling endpoint openai.
2024-06-25 18:23:24.444 DEBUG    Handling openai instrumented call to func: <function Completions.create at 0x0000023AB9BFF1C0>,
        bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000023AC22CD870>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
        response: ChatCompletion(id='chatcmpl-9dxcYKpeZwzCRVvD8ehQnA1odRWU0', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: 4\nCriteria: The context provides some relevant information about the physical characteristics of owls, but does not directly address how owls fly.\nSupporting Evidence: The context describes the general appearance of owls, including their flat face, small hooked beak, large forward-facing eyes, short tail, rounded wings, and large feet with sharp talons. While this information gives insight into the physical attributes of owls, it does not directly explain how owls fly. Therefore, the relevance is limited to the description of owl characteristics rather than their flying mechanism.', role='assistant', function_call=None, tool_calls=None))], created=1719311002, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=118, prompt_tokens=501, total_tokens=619))
2024-06-25 18:23:24.450 INFO     ✅ feedback result Context_Relevance DONE feedback_result_hash_0cba358c039a060e8668d5a6cb078daa
FeedbackDefinition(Answer_Relevance,
        selectors={'prompt': Lens().__record__.main_input, 'response': Lens().__record__.main_output},
        if_exists=None
)
Answer_Relevance (FeedbackResultStatus.DONE) = 0.9
  prompt = how do owls fly
  response = Owls fly by using sustained flight over grassland to catch rodents, dropping into the grass from sustained flight. Woodland owls secure prey by dropping from perches at the edges of forest openings. Additionally, some owls like the Southeast Asian hawk owl sally from a perch to take flying insects, while the whiskered owl takes flying insects in foliage.
  ret = 0.9
  meta =
    {'reason': 'Criteria: The response provides relevant information on how owls '
               'fly, covering various aspects of their hunting techniques.\n'
               'Supporting Evidence: The response details different hunting '
               'techniques used by owls, such as sustained flight over grassland, '
               'dropping from perches at the edges of forest openings, and '
               'sallying from perches to catch flying insects. This information '
               'directly addresses how owls fly and hunt for prey, demonstrating a '
               'comprehensive understanding of their flight behavior. Overall, the '
               'response is highly relevant to the prompt, earning a high score of '
               '9.'}

FeedbackDefinition(Context_Relevance,
        selectors={'question': Lens().__record__.main_input, 'context': Lens().__record__.app.query.rets.source_nodes[:].node.text},
        if_exists=None
)
Context_Relevance (FeedbackResultStatus.DONE) = 0.25
  question = how do owls fly
  context = Specialized forms of <span id="ref849947"></span><a href="https://www.britannica.com/science/feeding-behaviour" class="md-crosslink" data-show-preview="true">feeding behaviour</a> have been observed in some owls.  The <span id="ref115755"></span><a href="https://www.britannica.com/animal/elf-owl" class="md-crosslink" data-show-preview="true">elf owl</a> (<em>Micrathene whitneyi</em>), for instance, has been seen hovering before blossoms, where it scares insects into flight with its wings and then catches them with its beak.  A <span id="ref115756"></span><a href="https://www.britannica.com/animal/bay-owl" class="md-crosslink" data-show-preview="true">bay owl</a> (<em>Phodilus badius</em>) has been documented stationing itself within a cave to catch bats as they issued forth at dusk.
  ret = 0.2
  meta =
    {'reason': 'Criteria: The context does not directly address how owls fly.\n'
               'Supporting Evidence: The context discusses specialized feeding '
               'behavior observed in some owls, such as the elf owl hovering '
               'before blossoms to catch insects and the bay owl stationing itself '
               'in a cave to catch bats. While this information provides insight '
               'into how owls hunt for food, it does not address how owls fly. '
               'Therefore, the relevance to the question is minimal.'}
  question = how do owls fly
  context = In general the type of prey taken is <a class="md-dictionary-link md-dictionary-tt-off eb" data-term="dictated" href="https://www.britannica.com/dictionary/dictated" data-type="EB">dictated</a> by the size of the owl and by the relative abundance of potential prey.  Owls that hunt over grassland, such as the barn owl and short-eared owl, hunt by sustained flight, dropping into the grass to catch rodents.  Many woodland owls secure prey by dropping from perches at the edges of forest openings.
  ret = 0.2
  meta =
    {'reason': 'Criteria: The context provides some information about how owls '
               'hunt for prey, but does not directly address how owls fly.\n'
               'Supporting Evidence: The context discusses how owls hunt for prey '
               'by dropping into grass or from perches, but it does not '
               'specifically address how owls fly in terms of their flight '
               'mechanics or abilities. Therefore, the information provided is '
               'only somewhat relevant to the question of how owls fly.'}
  question = how do owls fly
  context = The <span id="ref115753"></span><a href="https://www.britannica.com/animal/Oriental-hawk-owl" class="md-crosslink">Southeast Asian hawk owl</a> (<em>Ninox scutulata</em>) sallies from a perch to take flying insects.  The <span id="ref115754"></span>whiskered owl (<em>Otus trichopsis</em>) takes flying insects in foliage.  <a href="https://www.britannica.com/animal/fish-owl" class="md-crosslink" data-show-preview="true">Fish owl</a>s (<em>Ketupa</em> and <em>Scotopelia</em>) are adapted for taking live <a href="https://www.britannica.com/animal/fish" class="md-crosslink autoxref" data-show-preview="true">fish</a> but also eat other animals.
  ret = 0.2
  meta =
    {'reason': 'Criteria: The context provides some information about different '
               'types of owls and their hunting behaviors, but does not directly '
               'address how owls fly.\n'
               'Supporting Evidence: The context mentions Southeast Asian hawk '
               'owls sallying from a perch to take flying insects, whiskered owls '
               'taking flying insects in foliage, and fish owls being adapted for '
               'taking live fish. While this information gives insight into the '
               'hunting behaviors of different owl species, it does not directly '
               'address the mechanics of how owls fly. Therefore, the relevance to '
               'the question "how do owls fly" is limited.'}
  question = how do owls fly
  context = All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons.
  ret = 0.4
  meta =
    {'reason': 'Criteria: The context provides some relevant information about the '
               'physical characteristics of owls, but does not directly address '
               'how owls fly.\n'
               'Supporting Evidence: The context describes the general appearance '
               'of owls, including their flat face, small hooked beak, large '
               'forward-facing eyes, short tail, rounded wings, and large feet '
               'with sharp talons. While this information gives insight into the '
               'physical attributes of owls, it does not directly explain how owls '
               'fly. Therefore, the relevance is limited to the description of owl '
               'characteristics rather than their flying mechanism.'}

FeedbackDefinition(Groundedness,
        selectors={'source': Lens().__record__.app.query.rets.source_nodes[:].node.text.collect(), 'statement': Lens().__record__.main_output},   
        if_exists=None
)
Groundedness (FeedbackResultStatus.DONE) = 1.0
  source = ['Specialized forms of <span id="ref849947"></span><a href="https://www.britannica.com/science/feeding-behaviour" class="md-crosslink" data-show-preview="true">feeding behaviour</a> have been observed in some owls.  The <span id="ref115755"></span><a href="https://www.britannica.com/animal/elf-owl" class="md-crosslink" data-show-preview="true">elf owl</a> (<em>Micrathene whitneyi</em>), for instance, has been seen hovering before blossoms, where it scares insects into flight with its wings and then catches them with its beak.  A <span id="ref115756"></span><a href="https://www.britannica.com/animal/bay-owl" class="md-crosslink" data-show-preview="true">bay owl</a> (<em>Phodilus badius</em>) has been documented stationing itself within a cave to catch bats as they issued forth at dusk. ', 'In general the type of prey taken is <a class="md-dictionary-linkem>Ketupa</em> and <em>Scotopelia</em>) are adapted for taking live <a href="https://www.britannica.com/animal/fish" class="md-crosslink autoxref" data-show-preview="true">fish</a> but also eat other animals. ', 'All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. ']  statement = Owls fly by using sustained flight over grassland to catch rodents, dropping into the grass from sustained flight. Woodland owls secure prey by dropping from perches at the edges of forest openings. Additionally, some owls like the Southeast Asian hawk owl sally from a perch to take flying insects, while the whiskered owl takes flying insects in foliage.  ret = 1.0  meta =    {'reasons': 'STATEMENT 0:\n'
                'Criteria: Owls fly by using sustained flight over grassland to '                'catch rodents, dropping into the grass from sustained flight.\n'                'Supporting Evidence: The source mentions that owls that hunt over '
                'grassland, such as the barn owl and short-eared owl, hunt by '
                'sustained flight, dropping into the grass to catch rodents.\n'
                'Score: 10\n'
                'STATEMENT 1:\n'
                'Criteria: Woodland owls secure prey by dropping from perches at '
                'the edges of forest openings.\n'
                'Supporting Evidence: The source mentions that many woodland owls '
                'secure prey by dropping from perches at the edges of forest '
                'openings.\n'
                'Score: 10\n'
                'STATEMENT 2:\n'
                'Criteria: Additionally, some owls like the Southeast Asian hawk '
                'owl sally from a perch to take flying insects,\n'
                'while the whiskered owl takes flying insects in foliage.\n'
                'Supporting Evidence: The Southeast Asian hawk owl (Ninox '
                'scutulata) is mentioned in the source as sallying from a perch to '
                'take flying insects.\n'
                'Score: 10\n'
                '\n'
                'The whiskered owl (Otus trichopsis) is mentioned in the source as '
                'taking flying insects in foliage.\n'
                'Score: 10\n'}

These are the logs that I get if it passes. It would only pass sometimes when I freshly run my code

Relevant Logs/Tracebacks Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.

2024-06-25 17:35:30.837 DEBUG    Looking via __find_tracker; found <frame at 0x0000020B9A873300, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>    
2024-06-25 17:35:30.838 DEBUG    Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.      
2024-06-25 17:35:30.838 DEBUG    Handling endpoint openai.
2024-06-25 17:35:30.838 DEBUG    Handling openai instrumented call to func: <function Completions.create at 0x0000020B81E470A0>,        
        bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000020B8A55D8A0>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: The <span id="ref115753"></span><a href="https://www.britannica.com/animal/Oriental-hawk-owl" class="md-crosslink">Southeast Asian hawk owl</a> (<em>Ninox scutulata</em>) sallies from a perch to take flying insects.  The <span id="ref115754"></span>whiskered owl (<em>Otus trichopsis</em>) takes flying insects in foliage.  <a href="https://www.britannica.com/animal/fish-owl" class="md-crosslink" data-show-preview="true">Fish owl</a>s (<em>Ketupa</em> and <em>Scotopelia</em>) are adapted for taking live <a href="https://www.britannica.com/animal/fish" class="md-crosslink autoxref" data-show-preview="true">fish</a> but also eat other animals. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
        response: ChatCompletion(id='chatcmpl-9dwsD6PGKTATXLSAkvtsV6tjC5QUm', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: 2\nCriteria: The context provides information about different species of owls and their hunting habits, but does not directly address how owls fly.\nSupporting Evidence: The context mentions Southeast Asian hawk owls, whiskered owls, and fish owls, detailing their hunting behaviors and prey preferences. While this information is interesting and relevant to owls, it does not directly answer the question of how owls fly. Therefore, the relevance score is low.', role='assistant', function_call=None, tool_calls=None))], created=1719308129, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=98, prompt_tokens=538, total_tokens=636))
2024-06-25 17:35:30.841 DEBUG    load_ssl_context verify=True cert=None trust_env=True http2=False
2024-06-25 17:35:30.843 DEBUG    load_verify_locations cafile='C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\certifi\\cacert.pem'
2024-06-25 17:35:30.852 DEBUG    module <module 'openai.resources' from 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\openai\\resources\\__init__.py'> already instrumented for create
2024-06-25 17:35:30.853 DEBUG    module <module 'openai.resources.chat' from 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\openai\\resources\\chat\\__init__.py'> already instrumented for create
2024-06-25 17:35:30.861 DEBUG    no frames found
2024-06-25 17:35:30.863 DEBUG    Calling instrumented method <function Completions.create at 0x0000020B81E470A0> of type <class 'function'>, iscoroutinefunction=False, isasyncgeneratorfunction=False
2024-06-25 17:35:30.867 DEBUG    Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], 'model': 'gpt-3.5-turbo', 'seed': 123, 'temperature': 0.0}}
2024-06-25 17:35:30.870 DEBUG    send_request_headers.started request=<Request [b'POST']>
2024-06-25 17:35:30.871 DEBUG    send_request_headers.complete
2024-06-25 17:35:30.871 DEBUG    send_request_body.started request=<Request [b'POST']>
2024-06-25 17:35:30.872 DEBUG    send_request_body.complete
2024-06-25 17:35:30.872 DEBUG    receive_response_headers.started request=<Request [b'POST']>
2024-06-25 17:35:31.751 DEBUG    receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 25 Jun 2024 09:35:32 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'openai-organization', b'dirawong'), (b'openai-processing-ms', b'545'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'60000'), (b'x-ratelimit-remaining-requests', b'9991'), (b'x-ratelimit-remaining-tokens', b'59472'), (b'x-ratelimit-reset-requests', b'1m14.515s'), (b'x-ratelimit-reset-tokens', b'528ms'), (b'x-request-id', b'c0541fcfa087338b1eef3e307ea07e11'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8993f7cc2c3a045c-HKG'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
2024-06-25 17:35:31.755 INFO     HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-25 17:35:31.755 DEBUG    receive_response_body.started request=<Request [b'POST']>
2024-06-25 17:35:31.756 DEBUG    receive_response_body.complete
2024-06-25 17:35:31.757 DEBUG    response_closed.started
2024-06-25 17:35:31.757 DEBUG    response_closed.complete
2024-06-25 17:35:31.757 DEBUG    HTTP Request: POST https://api.openai.com/v1/chat/completions "200 OK"
2024-06-25 17:35:31.763 DEBUG    Looking via __find_tracker; found <frame at 0x0000020B8CC49F60, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>    
2024-06-25 17:35:31.764 DEBUG    Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.      
2024-06-25 17:35:31.764 DEBUG    Handling endpoint openai.
2024-06-25 17:35:31.764 DEBUG    Handling openai instrumented call to func: <function Completions.create at 0x0000020B81E470A0>,        
        bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000020B8A55D8A0>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
        response: ChatCompletion(id='chatcmpl-9dwsFaBFGQ9Re1ED40tOS8t1faoaz', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: \nCriteria: Relevance of the context to the question asked.\nSupporting Evidence: ', role='assistant', function_call=None, tool_calls=None))], created=1719308131, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=23, prompt_tokens=501, total_tokens=524))
2024-06-25 17:35:31.767 WARNING  Feedback Function exception caught: Traceback (most recent call last):
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\feedback.py", line 865, in run   
    result_and_meta, part_cost = mod_base_endpoint.Endpoint.track_all_costs_tally(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 496, in track_all_costs_tally
    result, cbs = Endpoint.track_all_costs(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 477, in track_all_costs
    return Endpoint._track_costs(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 574, in _track_costs
    result: T = __func(*args, **kwargs)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\base.py", line 349, in context_relevance_with_cot_reasons
    return self.generate_score_and_reasons(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\base.py", line 208, in generate_score_and_reasons
    score = mod_generated_utils.re_0_10_rating(line) / normalize
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\utils\generated.py", line 70, in re_0_10_rating
    raise ParseError("int or float number", s, pattern=PATTERN_INTEGER)
trulens_eval.utils.generated.ParseError: Tried to find int or float number using pattern ([+-]?[1-9][0-9]*|0) in
  Score:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\feedback.py", line 871, in run   
    raise RuntimeError(
RuntimeError: Evaluation of Context_Relevance failed on inputs:
{'context': 'All owls have the same general appearance, which is characterized '
            'by a flat face with a small hooked.

2024-06-25 17:35:31.772 INFO     🛑 feedback result Context_Relevance FAILED feedback_result_hash_7caf50a44a0b57d2814f2dd52bd03a1e      
FeedbackDefinition(Answer_Relevance,
        selectors={'prompt': Lens().__record__.main_input, 'response': Lens().__record__.main_output},
        if_exists=None
)
Answer_Relevance (FeedbackResultStatus.DONE) = 0.8
  prompt = how do owls fly
  response = Owls fly by using sustained flight over grassland to catch rodents, dropping into the grass from the air. Woodland owls secure prey by dropping from perches at the edges of forest openings. Additionally, some owls like the Southeast Asian hawk owl sally from a perch to catch flying insects, while others like the whiskered owl take flying insects in foliage.
  ret = 0.8
  meta =
    {'reason': 'Criteria: The response provides relevant information about how '
               'owls fly, covering different hunting techniques and environments.\n'
               'Supporting Evidence: The response discusses various hunting '
               'techniques of owls, such as sustained flight over grassland, '
               'dropping from perches at the edges of forest openings, and '
               'sallying from a perch to catch flying insects. It also mentions '
               'specific owl species like the Southeast Asian hawk owl and the '
               'whiskered owl, showcasing a diverse range of hunting behaviors. '
               'This information directly addresses how owls fly and hunt, making '
               'it highly relevant to the prompt. Overall, the response is '
               'detailed and informative, earning a high score of 8.'}

FeedbackDefinition(Context_Relevance,
        selectors={'question': Lens().__record__.main_input, 'context': Lens().__record__.app.query.rets.source_nodes[:].node.text},    
        if_exists=None
)
Context_Relevance (FeedbackResultStatus.FAILED) = None

FeedbackDefinition(Groundedness,
        selectors={'source': Lens().__record__.app.query.rets.source_nodes[:].node.text.collect(), 'statement': Lens().__record__.main_output},
        if_exists=None
                'the whiskered owl take flying insects in foliage.\n'
                'Supporting Evidence: The source mentions that the Southeast Asian '
                'hawk owl sallies from a perch to take flying insects, and the '                'whiskered owl takes flying insects in foliage.\n'                'Score: 10\n'}

Environment:

OS: [e.g. MacOS, Windows]
Python Version
TruLens version
Versions of other relevant installed libraries

Additional context Add any other context about the problem here. Is there a way where we can check if we're getting the context properly? Is this an issue where the context is not yet set, but the feedback suddenly runs?

This is the calls that im getting if the feedback passes

[{"args": {"question": "how do owls fly", "context": "Specialized forms of <span id=\"ref849947\"></span><a href=\"https://www.britannica.com/science/feeding-behaviour\" class=\"md-crosslink\" data-show-preview=\"true\">feeding behaviour</a> have been observed in some owls.  The <span id=\"ref115755\"></span><a href=\"https://www.britannica.com/animal/elf-owl\" class=\"md-crosslink\" data-show-preview=\"true\">elf owl</a> (<em>Micrathene whitneyi</em>), for instance, has been seen hovering before blossoms, where it scares insects into flight with its wings and then catches them with its beak.  A <span id=\"ref115756\"></span><a href=\"https://www.britannica.com/animal/bay-owl\" class=\"md-crosslink\" data-show-preview=\"true\">bay owl</a> (<em>Phodilus badius</em>) has been documented stationing itself within a cave to catch bats as they issued forth at dusk. "}, "ret": 0.2, "meta": {"reason": "Criteria: The context provides information on the feeding behavior of owls but does not directly address how owls fly.\nSupporting Evidence: The context discusses specialized feeding behavior observed in some owls, such as the elf owl hovering before blossoms and the bay owl stationing itself within a cave to catch bats. While this information is interesting, it does not directly address how owls fly, which is the main focus of the question. Therefore, the relevance is limited, resulting in a low score."}}, {"args": {"question": "how do owls fly", "context": "In general the type of prey taken is <a class=\"md-dictionary-link md-dictionary-tt-off eb\" data-term=\"dictated\" href=\"https://www.britannica.com/dictionary/dictated\" data-type=\"EB\">dictated</a> by the size of the owl and by the relative abundance of potential prey.  Owls that hunt over grassland, such as the barn owl and short-eared owl, hunt by sustained flight, dropping into the grass to catch rodents.  Many woodland owls secure prey by dropping from perches at the edges of forest openings. "}, "ret": 0.2, "meta": {"reason": "Criteria: The context provides some information about how owls hunt for prey, but does not directly address how owls fly.\nSupporting Evidence: The context discusses how owls hunt for prey by dropping into grass or from perches, but it does not specifically address how owls fly in terms of their flight mechanics or abilities. Therefore, the information provided is only somewhat relevant to the question of how owls fly."}}, {"args": {"question": "how do owls fly", "context": "The <span id=\"ref115753\"></span><a href=\"https://www.britannica.com/animal/Oriental-hawk-owl\" class=\"md-crosslink\">Southeast Asian hawk owl</a> (<em>Ninox scutulata</em>) sallies from a perch to take flying insects.  The <span id=\"ref115754\"></span>whiskered owl (<em>Otus trichopsis</em>) takes flying insects in foliage.  <a href=\"https://www.britannica.com/animal/fish-owl\" class=\"md-crosslink\" data-show-preview=\"true\">Fish owl</a>s (<em>Ketupa</em> and <em>Scotopelia</em>) are adapted for taking live <a href=\"https://www.britannica.com/animal/fish\" class=\"md-crosslink autoxref\" data-show-preview=\"true\">fish</a> but also eat other animals. "}, "ret": 0.2, "meta": {"reason": "Criteria: The context provides information on different types of owls and their hunting behaviors, but does not directly address how owls fly.\nSupporting Evidence: The context mentions Southeast Asian hawk owls sallying from a perch to take flying insects, whiskered owls taking flying insects in foliage, and fish owls adapted for taking live fish. While this information is relevant to the hunting behaviors of owls, it does not directly address how owls fly. Therefore, the context is only partially relevant to the question of how owls fly."}}, {"args": {"question": "how do owls fly", "context": "All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href=\"https://www.britannica.com/science/beak\" class=\"md-crosslink autoxref\" data-show-preview=\"true\">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href=\"https://www.britannica.com/animal/bird-of-prey\" class=\"md-crosslink\" data-show-preview=\"true\">birds of prey</a> (order <a href=\"https://www.britannica.com/animal/falconiform\" class=\"md-crosslink\" data-show-preview=\"true\">Falconiformes</a>), they have large feet with sharp talons. "}, "ret": 0.4, "meta": {"reason": "Criteria: The context provides some relevant information about the physical characteristics of owls, but does not directly address how owls fly.\nSupporting Evidence: The context describes the appearance of owls, mentioning their flat face, small hooked beak, large forward-facing eyes, short tail, rounded wings, and large feet with sharp talons. While this information gives insight into the anatomy of owls, it does not specifically explain how owls fly. Therefore, the relevance is limited to the physical characteristics of owls rather than their flight mechanism."}}]

kaoru-will commented 4 months ago

please help @joshreini1 thank you

dosubot[bot] commented 4 months ago

Hey @kaoru-will! 👋 I'm here to help you with any bugs, questions, or contributions while you wait for a human maintainer. I'm Dosu, a bot designed to assist you with all things related to the repository.

To resolve the issue with the Context Relevance feedback function providing inconsistent answers and sometimes failing with a ParseError, you should ensure that the PromptTemplate instances are correctly defined and used. Here are some steps and best practices to follow:

Ensure Correct PromptTemplate Usage: Make sure the PromptTemplate is correctly instantiated and used in your code. Here is the relevant part of the code for ContextRelevance:

class ContextRelevance(Relevance, WithPrompt):
   system_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
       """You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.
       Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. 

       A few additional scoring guidelines:

       - Long CONTEXTS should score equally well as short CONTEXTS.

       - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.

       - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.

       - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.

       - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.

       - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.

       - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.

       - Never elaborate."""
   )
   user_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
       """QUESTION: {question}

       CONTEXT: {context}

       RELEVANCE: """
   )

Check for ParseError: The ParseError might be due to incorrect formatting or missing placeholders in the template strings. Double-check the template strings for any syntax issues or missing placeholders.

Use the context_relevance Method: You can use the context_relevance method from the Provider class to generate the relevance score. Here is an example of how to use it:

def context_relevance(
   self, question: str, context: str, temperature: float = 0.0
) -> float:
   """
   Uses chat completion model. A function that completes a template to
   check the relevance of the context to the question.

   Args:
       question (str): A question being asked.
       context (str): Context related to the question.

   Returns:
       float: A value between 0.0 (not relevant) and 1.0 (relevant).
   """

   return self.generate_score(
       system_prompt=prompts.CONTEXT_RELEVANCE_SYSTEM,
       user_prompt=str.format(
           prompts.CONTEXT_RELEVANCE_USER,
           question=question,
           context=context
       ),
       temperature=temperature
   )

Example Usage: Here is an example of how to initialize and use feedback functions, including context relevance:

import numpy as np
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI

# Initialize provider class
provider = OpenAI()

# select context to be used in feedback. the location of context is app specific.
from trulens_eval.app import App

context = App.select_context(rag_chain)

# Define a groundedness feedback function
f_groundedness = (
   Feedback(provider.groundedness_measure_with_cot_reasons)
   .on(context.collect())  # collect context chunks into a list
   .on_output()
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = Feedback(provider.relevance).on_input_output()

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
   Feedback(provider.context_relevance_with_cot_reasons)
   .on_input()
   .on(context)
   .aggregate(np.mean)
)

# Instrument chain for logging with TruLens
from trulens_eval import TruChain

tru_recorder = TruChain(
   rag_chain,
   app_id='Chain1_ChatApplication',
   feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness]
)

response, tru_record = tru_recorder.with_record(
   rag_chain.invoke, "What is Task Decomposition?"
)

# Retrieve records and feedback
rec = tru_record.get()  # use .get if only one record

for feedback, feedback_result in rec.wait_for_feedback_results().items():
   print(feedback.name, feedback_result.result)

By following these steps, you should be able to resolve the inconsistencies and errors with the Context Relevance feedback function ^[1]^[2]^[3].

@joshreini1, it seems like this issue requires a deeper look and your expertise would be highly valuable here.

To continue talking to Dosu, mention @dosu.

kaoru-will commented 4 months ago

from trulens_eval.feedback.provider.openai import OpenAI
from trulens_eval.app import App
from trulens_eval import Feedback, Select, TruLlama, FeedbackMode, feedback

import numpy as np

provider = OpenAI()

these are the modules that I used and the provider used

kaoru-will commented 4 months ago

hi any updates on this?

dosubot[bot] commented 4 months ago

To address the issue with the Context Relevance feedback function providing inconsistent answers and sometimes failing with a ParseError, you can follow these steps to ensure proper setup and usage:

Ensure Correct PromptTemplate Usage: Make sure the PromptTemplate is correctly instantiated and used in your code. Here is the relevant part of the code for ContextRelevance:

class ContextRelevance(Relevance, WithPrompt):
   system_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
       """You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.
       Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. 

       A few additional scoring guidelines:

       - Long CONTEXTS should score equally well as short CONTEXTS.

       - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.

       - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.

       - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.

       - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.

       - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.

       - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.

       - Never elaborate."""
   )
   user_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
       """QUESTION: {question}

       CONTEXT: {context}

       RELEVANCE: """
   )

Check for ParseError: The ParseError might be due to incorrect formatting or missing placeholders in the template strings. Double-check the template strings for any syntax issues or missing placeholders.

Use the context_relevance Method: You can use the context_relevance method from the Provider class to generate the relevance score. Here is an example of how to use it:

def context_relevance(
   self, question: str, context: str, temperature: float = 0.0
) -> float:
   """
   Uses chat completion model. A function that completes a template to
   check the relevance of the context to the question.

   Args:
       question (str): A question being asked.
       context (str): Context related to the question.

   Returns:
       float: A value between 0.0 (not relevant) and 1.0 (relevant).
   """

   return self.generate_score(
       system_prompt=prompts.CONTEXT_RELEVANCE_SYSTEM,
       user_prompt=str.format(
           prompts.CONTEXT_RELEVANCE_USER,
           question=question,
           context=context
       ),
       temperature=temperature
   )

Example Usage: Here is an example of how to initialize and use feedback functions, including context relevance:

import numpy as np
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI

# Initialize provider class
provider = OpenAI()

# select context to be used in feedback. the location of context is app specific.
from trulens_eval.app import App

context = App.select_context(rag_chain)

# Define a groundedness feedback function
f_groundedness = (
   Feedback(provider.groundedness_measure_with_cot_reasons)
   .on(context.collect())  # collect context chunks into a list
   .on_output()
)

# Question/answer relevance between overall question and answer.
f_answer_relevance = Feedback(provider.relevance).on_input_output()

# Question/statement relevance between question and each context chunk.
f_context_relevance = (
   Feedback(provider.context_relevance_with_cot_reasons)
   .on_input()
   .on(context)
   .aggregate(np.mean)
)

# Instrument chain for logging with TruLens
from trulens_eval import TruChain

tru_recorder = TruChain(
   rag_chain,
   app_id='Chain1_ChatApplication',
   feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness]
)

response, tru_record = tru_recorder.with_record(
   rag_chain.invoke, "What is Task Decomposition?"
)

# Retrieve records and feedback
rec = tru_record.get()  # use .get if only one record

for feedback, feedback_result in rec.wait_for_feedback_results().items():
   print(feedback.name, feedback_result.result)

By following these steps, you should be able to resolve the inconsistencies and errors with the Context Relevance feedback function ^[1]^[2].

To continue talking to Dosu, mention @dosu.

kaoru-will commented 4 months ago

Hi any updates on this? Could really use some help....

already changed my query engine into

from trulens_eval.feedback.provider.openai import OpenAI
from trulens_eval.app import App
from trulens_eval import Feedback, Select, TruLlama, FeedbackMode, feedback

import numpy as np

provider = OpenAI()

def get_prebuilt_trulens_recorder(query_engine, app_id):

    context = App.select_context(query_engine)

    f_answer_relevance = Feedback(
        provider.relevance_with_cot_reasons, name= "Answer_Relevance"
    ).on_input().on_output()

    # Define a groundedness feedback function
    f_groundedness = (
        Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
        .on(context.collect())
        .on_output()

    )
    # Question/statement relevance between question and each context chunk.

    f_context_relevance = (
        Feedback(provider.qs_relevance_with_cot_reasons, name="Context_Relevance")
        .on_input()
        .on(context)
        .aggregate(np.mean)
    )

    f_context_relevance("what is the capital of France?","France is a unitary, semi-presidential republic with its capital in Paris, it's largest city and main cultural and commercial center.")

    # print(f_context_relevance)

    feedbacks = [f_answer_relevance, f_context_relevance, f_groundedness]

    tru_recorder = TruLlama(
        query_engine,
        app_id=app_id,
        feedbacks=feedbacks,

        )

    return tru_recorder

but the context is still giving inconsistent answers and failing randomly

Sometimes, it wouldn't get the correct Context, sometimes it would....

can you help me understand how to go about this? Or would it be possible to create my own customized PromptTemplate

sfc-gh-jreini commented 4 months ago

Hi! Sorry for the delay on this. What is the failure rate? 1/2, 1/10, 1/100?

sfc-gh-jreini commented 4 months ago

Can you also share an example of the call when it fails?

sfc-gh-jreini commented 4 months ago

It may also help to try a more powerful model such as gpt-4o to see if that reduces the error rate. It may be that some of the text in the context you are selecting is confusing to the Judge LLM.

kaoru-will commented 4 months ago

HI @sfc-gh-jreini

2024-06-25 17:35:30.837 DEBUG    Looking via __find_tracker; found <frame at 0x0000020B9A873300, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>    
2024-06-25 17:35:30.838 DEBUG    Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.      
2024-06-25 17:35:30.838 DEBUG    Handling endpoint openai.
2024-06-25 17:35:30.838 DEBUG    Handling openai instrumented call to func: <function Completions.create at 0x0000020B81E470A0>,        
        bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000020B8A55D8A0>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: The <span id="ref115753"></span><a href="https://www.britannica.com/animal/Oriental-hawk-owl" class="md-crosslink">Southeast Asian hawk owl</a> (<em>Ninox scutulata</em>) sallies from a perch to take flying insects.  The <span id="ref115754"></span>whiskered owl (<em>Otus trichopsis</em>) takes flying insects in foliage.  <a href="https://www.britannica.com/animal/fish-owl" class="md-crosslink" data-show-preview="true">Fish owl</a>s (<em>Ketupa</em> and <em>Scotopelia</em>) are adapted for taking live <a href="https://www.britannica.com/animal/fish" class="md-crosslink autoxref" data-show-preview="true">fish</a> but also eat other animals. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
        response: ChatCompletion(id='chatcmpl-9dwsD6PGKTATXLSAkvtsV6tjC5QUm', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: 2\nCriteria: The context provides information about different species of owls and their hunting habits, but does not directly address how owls fly.\nSupporting Evidence: The context mentions Southeast Asian hawk owls, whiskered owls, and fish owls, detailing their hunting behaviors and prey preferences. While this information is interesting and relevant to owls, it does not directly answer the question of how owls fly. Therefore, the relevance score is low.', role='assistant', function_call=None, tool_calls=None))], created=1719308129, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=98, prompt_tokens=538, total_tokens=636))
2024-06-25 17:35:30.841 DEBUG    load_ssl_context verify=True cert=None trust_env=True http2=False
2024-06-25 17:35:30.843 DEBUG    load_verify_locations cafile='C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\certifi\\cacert.pem'
2024-06-25 17:35:30.852 DEBUG    module <module 'openai.resources' from 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\openai\\resources\\__init__.py'> already instrumented for create
2024-06-25 17:35:30.853 DEBUG    module <module 'openai.resources.chat' from 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\openai\\resources\\chat\\__init__.py'> already instrumented for create
2024-06-25 17:35:30.861 DEBUG    no frames found
2024-06-25 17:35:30.863 DEBUG    Calling instrumented method <function Completions.create at 0x0000020B81E470A0> of type <class 'function'>, iscoroutinefunction=False, isasyncgeneratorfunction=False
2024-06-25 17:35:30.867 DEBUG    Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], 'model': 'gpt-3.5-turbo', 'seed': 123, 'temperature': 0.0}}
2024-06-25 17:35:30.870 DEBUG    send_request_headers.started request=<Request [b'POST']>
2024-06-25 17:35:30.871 DEBUG    send_request_headers.complete
2024-06-25 17:35:30.871 DEBUG    send_request_body.started request=<Request [b'POST']>
2024-06-25 17:35:30.872 DEBUG    send_request_body.complete
2024-06-25 17:35:30.872 DEBUG    receive_response_headers.started request=<Request [b'POST']>
2024-06-25 17:35:31.751 DEBUG    receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 25 Jun 2024 09:35:32 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'openai-organization', b'dirawong'), (b'openai-processing-ms', b'545'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'60000'), (b'x-ratelimit-remaining-requests', b'9991'), (b'x-ratelimit-remaining-tokens', b'59472'), (b'x-ratelimit-reset-requests', b'1m14.515s'), (b'x-ratelimit-reset-tokens', b'528ms'), (b'x-request-id', b'c0541fcfa087338b1eef3e307ea07e11'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8993f7cc2c3a045c-HKG'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
2024-06-25 17:35:31.755 INFO     HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-25 17:35:31.755 DEBUG    receive_response_body.started request=<Request [b'POST']>
2024-06-25 17:35:31.756 DEBUG    receive_response_body.complete
2024-06-25 17:35:31.757 DEBUG    response_closed.started
2024-06-25 17:35:31.757 DEBUG    response_closed.complete
2024-06-25 17:35:31.757 DEBUG    HTTP Request: POST https://api.openai.com/v1/chat/completions "200 OK"
2024-06-25 17:35:31.763 DEBUG    Looking via __find_tracker; found <frame at 0x0000020B8CC49F60, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>    
2024-06-25 17:35:31.764 DEBUG    Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.      
2024-06-25 17:35:31.764 DEBUG    Handling endpoint openai.
2024-06-25 17:35:31.764 DEBUG    Handling openai instrumented call to func: <function Completions.create at 0x0000020B81E470A0>,        
        bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000020B8A55D8A0>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n        Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n        A few additional scoring guidelines:\n\n        - Long CONTEXTS should score equally well as short CONTEXTS.\n\n        - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n        - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n        - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n        - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n        - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n        - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n        CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes.  The tail is short and the wings are rounded.  Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n        \n        \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
        response: ChatCompletion(id='chatcmpl-9dwsFaBFGQ9Re1ED40tOS8t1faoaz', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: \nCriteria: Relevance of the context to the question asked.\nSupporting Evidence: ', role='assistant', function_call=None, tool_calls=None))], created=1719308131, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=23, prompt_tokens=501, total_tokens=524))
2024-06-25 17:35:31.767 WARNING  Feedback Function exception caught: Traceback (most recent call last):
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\feedback.py", line 865, in run   
    result_and_meta, part_cost = mod_base_endpoint.Endpoint.track_all_costs_tally(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 496, in track_all_costs_tally
    result, cbs = Endpoint.track_all_costs(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 477, in track_all_costs
    return Endpoint._track_costs(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 574, in _track_costs
    result: T = __func(*args, **kwargs)
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\base.py", line 349, in context_relevance_with_cot_reasons
    return self.generate_score_and_reasons(
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\base.py", line 208, in generate_score_and_reasons
    score = mod_generated_utils.re_0_10_rating(line) / normalize
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\utils\generated.py", line 70, in re_0_10_rating
    raise ParseError("int or float number", s, pattern=PATTERN_INTEGER)
trulens_eval.utils.generated.ParseError: Tried to find int or float number using pattern ([+-]?[1-9][0-9]*|0) in
  Score:

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\feedback.py", line 871, in run   
    raise RuntimeError(
RuntimeError: Evaluation of Context_Relevance failed on inputs:
{'context': 'All owls have the same general appearance, which is characterized '
            'by a flat face with a small hooked.

these are the logs that im getting whenever context_relevance would fail

please see above of the other logs to compare

also the failure rate is almost at 3/5 (where sometimes it would work 2/5 times, it would give an score, but most of the time, it would run to a "trulens_eval.utils.generated.ParseError: Tried to find int or float number using pattern ([+-]?[1-9][0-9]*|0) in Score:" always

sfc-gh-jreini commented 4 months ago

Is the france test from your code representative of your problem, i.e. failing 3/5 times?

f_context_relevance("what is the capital of France?","France is a unitary, semi-presidential republic with its capital in Paris, it's largest city and main cultural and commercial center.")

sfc-gh-pdharmana commented 3 weeks ago

Closing this for now. @kaoru-will pls reopen if you need help

truera / trulens

[BUG] Inconsistent context relevance displayed status #1254