Closed kaoru-will closed 3 weeks ago
please help @joshreini1 thank you
Hey @kaoru-will! 👋 I'm here to help you with any bugs, questions, or contributions while you wait for a human maintainer. I'm Dosu, a bot designed to assist you with all things related to the repository.
To resolve the issue with the Context Relevance
feedback function providing inconsistent answers and sometimes failing with a ParseError
, you should ensure that the PromptTemplate
instances are correctly defined and used. Here are some steps and best practices to follow:
Ensure Correct PromptTemplate Usage:
Make sure the PromptTemplate
is correctly instantiated and used in your code. Here is the relevant part of the code for ContextRelevance
:
class ContextRelevance(Relevance, WithPrompt):
system_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
"""You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.
Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant.
A few additional scoring guidelines:
- Long CONTEXTS should score equally well as short CONTEXTS.
- RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.
- RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.
- CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.
- CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.
- CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.
- CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.
- Never elaborate."""
)
user_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
"""QUESTION: {question}
CONTEXT: {context}
RELEVANCE: """
)
Check for ParseError:
The ParseError
might be due to incorrect formatting or missing placeholders in the template strings. Double-check the template strings for any syntax issues or missing placeholders.
Use the context_relevance
Method:
You can use the context_relevance
method from the Provider
class to generate the relevance score. Here is an example of how to use it:
def context_relevance(
self, question: str, context: str, temperature: float = 0.0
) -> float:
"""
Uses chat completion model. A function that completes a template to
check the relevance of the context to the question.
Args:
question (str): A question being asked.
context (str): Context related to the question.
Returns:
float: A value between 0.0 (not relevant) and 1.0 (relevant).
"""
return self.generate_score(
system_prompt=prompts.CONTEXT_RELEVANCE_SYSTEM,
user_prompt=str.format(
prompts.CONTEXT_RELEVANCE_USER,
question=question,
context=context
),
temperature=temperature
)
Example Usage: Here is an example of how to initialize and use feedback functions, including context relevance:
import numpy as np
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI
# Initialize provider class
provider = OpenAI()
# select context to be used in feedback. the location of context is app specific.
from trulens_eval.app import App
context = App.select_context(rag_chain)
# Define a groundedness feedback function
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(context.collect()) # collect context chunks into a list
.on_output()
)
# Question/answer relevance between overall question and answer.
f_answer_relevance = Feedback(provider.relevance).on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
# Instrument chain for logging with TruLens
from trulens_eval import TruChain
tru_recorder = TruChain(
rag_chain,
app_id='Chain1_ChatApplication',
feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness]
)
response, tru_record = tru_recorder.with_record(
rag_chain.invoke, "What is Task Decomposition?"
)
# Retrieve records and feedback
rec = tru_record.get() # use .get if only one record
for feedback, feedback_result in rec.wait_for_feedback_results().items():
print(feedback.name, feedback_result.result)
By following these steps, you should be able to resolve the inconsistencies and errors with the Context Relevance
feedback function [1][2][3].
@joshreini1, it seems like this issue requires a deeper look and your expertise would be highly valuable here.
To continue talking to Dosu, mention @dosu.
from trulens_eval.feedback.provider.openai import OpenAI
from trulens_eval.app import App
from trulens_eval import Feedback, Select, TruLlama, FeedbackMode, feedback
import numpy as np
provider = OpenAI()
these are the modules that I used and the provider used
hi any updates on this?
To address the issue with the Context Relevance
feedback function providing inconsistent answers and sometimes failing with a ParseError
, you can follow these steps to ensure proper setup and usage:
Ensure Correct PromptTemplate Usage:
Make sure the PromptTemplate
is correctly instantiated and used in your code. Here is the relevant part of the code for ContextRelevance
:
class ContextRelevance(Relevance, WithPrompt):
system_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
"""You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.
Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant.
A few additional scoring guidelines:
- Long CONTEXTS should score equally well as short CONTEXTS.
- RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.
- RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.
- CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.
- CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.
- CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.
- CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.
- Never elaborate."""
)
user_prompt: ClassVar[PromptTemplate] = PromptTemplate.from_template(
"""QUESTION: {question}
CONTEXT: {context}
RELEVANCE: """
)
Check for ParseError:
The ParseError
might be due to incorrect formatting or missing placeholders in the template strings. Double-check the template strings for any syntax issues or missing placeholders.
Use the context_relevance
Method:
You can use the context_relevance
method from the Provider
class to generate the relevance score. Here is an example of how to use it:
def context_relevance(
self, question: str, context: str, temperature: float = 0.0
) -> float:
"""
Uses chat completion model. A function that completes a template to
check the relevance of the context to the question.
Args:
question (str): A question being asked.
context (str): Context related to the question.
Returns:
float: A value between 0.0 (not relevant) and 1.0 (relevant).
"""
return self.generate_score(
system_prompt=prompts.CONTEXT_RELEVANCE_SYSTEM,
user_prompt=str.format(
prompts.CONTEXT_RELEVANCE_USER,
question=question,
context=context
),
temperature=temperature
)
Example Usage: Here is an example of how to initialize and use feedback functions, including context relevance:
import numpy as np
from trulens_eval import Feedback
from trulens_eval.feedback.provider import OpenAI
# Initialize provider class
provider = OpenAI()
# select context to be used in feedback. the location of context is app specific.
from trulens_eval.app import App
context = App.select_context(rag_chain)
# Define a groundedness feedback function
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons)
.on(context.collect()) # collect context chunks into a list
.on_output()
)
# Question/answer relevance between overall question and answer.
f_answer_relevance = Feedback(provider.relevance).on_input_output()
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
Feedback(provider.context_relevance_with_cot_reasons)
.on_input()
.on(context)
.aggregate(np.mean)
)
# Instrument chain for logging with TruLens
from trulens_eval import TruChain
tru_recorder = TruChain(
rag_chain,
app_id='Chain1_ChatApplication',
feedbacks=[f_answer_relevance, f_context_relevance, f_groundedness]
)
response, tru_record = tru_recorder.with_record(
rag_chain.invoke, "What is Task Decomposition?"
)
# Retrieve records and feedback
rec = tru_record.get() # use .get if only one record
for feedback, feedback_result in rec.wait_for_feedback_results().items():
print(feedback.name, feedback_result.result)
By following these steps, you should be able to resolve the inconsistencies and errors with the Context Relevance
feedback function [1][2].
To continue talking to Dosu, mention @dosu.
Hi any updates on this? Could really use some help....
already changed my query engine into
from trulens_eval.feedback.provider.openai import OpenAI
from trulens_eval.app import App
from trulens_eval import Feedback, Select, TruLlama, FeedbackMode, feedback
import numpy as np
provider = OpenAI()
def get_prebuilt_trulens_recorder(query_engine, app_id):
context = App.select_context(query_engine)
f_answer_relevance = Feedback(
provider.relevance_with_cot_reasons, name= "Answer_Relevance"
).on_input().on_output()
# Define a groundedness feedback function
f_groundedness = (
Feedback(provider.groundedness_measure_with_cot_reasons, name = "Groundedness")
.on(context.collect())
.on_output()
)
# Question/statement relevance between question and each context chunk.
f_context_relevance = (
Feedback(provider.qs_relevance_with_cot_reasons, name="Context_Relevance")
.on_input()
.on(context)
.aggregate(np.mean)
)
f_context_relevance("what is the capital of France?","France is a unitary, semi-presidential republic with its capital in Paris, it's largest city and main cultural and commercial center.")
# print(f_context_relevance)
feedbacks = [f_answer_relevance, f_context_relevance, f_groundedness]
tru_recorder = TruLlama(
query_engine,
app_id=app_id,
feedbacks=feedbacks,
)
return tru_recorder
but the context is still giving inconsistent answers and failing randomly
Sometimes, it wouldn't get the correct Context, sometimes it would....
can you help me understand how to go about this? Or would it be possible to create my own customized PromptTemplate
Hi! Sorry for the delay on this. What is the failure rate? 1/2, 1/10, 1/100?
Can you also share an example of the call when it fails?
It may also help to try a more powerful model such as gpt-4o to see if that reduces the error rate. It may be that some of the text in the context you are selecting is confusing to the Judge LLM.
HI @sfc-gh-jreini
2024-06-25 17:35:30.837 DEBUG Looking via __find_tracker; found <frame at 0x0000020B9A873300, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>
2024-06-25 17:35:30.838 DEBUG Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.
2024-06-25 17:35:30.838 DEBUG Handling endpoint openai.
2024-06-25 17:35:30.838 DEBUG Handling openai instrumented call to func: <function Completions.create at 0x0000020B81E470A0>,
bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000020B8A55D8A0>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n A few additional scoring guidelines:\n\n - Long CONTEXTS should score equally well as short CONTEXTS.\n\n - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n CONTEXT: The <span id="ref115753"></span><a href="https://www.britannica.com/animal/Oriental-hawk-owl" class="md-crosslink">Southeast Asian hawk owl</a> (<em>Ninox scutulata</em>) sallies from a perch to take flying insects. The <span id="ref115754"></span>whiskered owl (<em>Otus trichopsis</em>) takes flying insects in foliage. <a href="https://www.britannica.com/animal/fish-owl" class="md-crosslink" data-show-preview="true">Fish owl</a>s (<em>Ketupa</em> and <em>Scotopelia</em>) are adapted for taking live <a href="https://www.britannica.com/animal/fish" class="md-crosslink autoxref" data-show-preview="true">fish</a> but also eat other animals. \n \n \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
response: ChatCompletion(id='chatcmpl-9dwsD6PGKTATXLSAkvtsV6tjC5QUm', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: 2\nCriteria: The context provides information about different species of owls and their hunting habits, but does not directly address how owls fly.\nSupporting Evidence: The context mentions Southeast Asian hawk owls, whiskered owls, and fish owls, detailing their hunting behaviors and prey preferences. While this information is interesting and relevant to owls, it does not directly answer the question of how owls fly. Therefore, the relevance score is low.', role='assistant', function_call=None, tool_calls=None))], created=1719308129, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=98, prompt_tokens=538, total_tokens=636))
2024-06-25 17:35:30.841 DEBUG load_ssl_context verify=True cert=None trust_env=True http2=False
2024-06-25 17:35:30.843 DEBUG load_verify_locations cafile='C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\certifi\\cacert.pem'
2024-06-25 17:35:30.852 DEBUG module <module 'openai.resources' from 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\openai\\resources\\__init__.py'> already instrumented for create
2024-06-25 17:35:30.853 DEBUG module <module 'openai.resources.chat' from 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\openai\\resources\\chat\\__init__.py'> already instrumented for create
2024-06-25 17:35:30.861 DEBUG no frames found
2024-06-25 17:35:30.863 DEBUG Calling instrumented method <function Completions.create at 0x0000020B81E470A0> of type <class 'function'>, iscoroutinefunction=False, isasyncgeneratorfunction=False
2024-06-25 17:35:30.867 DEBUG Request options: {'method': 'post', 'url': '/chat/completions', 'files': None, 'json_data': {'messages': [{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n A few additional scoring guidelines:\n\n - Long CONTEXTS should score equally well as short CONTEXTS.\n\n - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes. The tail is short and the wings are rounded. Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n \n \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], 'model': 'gpt-3.5-turbo', 'seed': 123, 'temperature': 0.0}}
2024-06-25 17:35:30.870 DEBUG send_request_headers.started request=<Request [b'POST']>
2024-06-25 17:35:30.871 DEBUG send_request_headers.complete
2024-06-25 17:35:30.871 DEBUG send_request_body.started request=<Request [b'POST']>
2024-06-25 17:35:30.872 DEBUG send_request_body.complete
2024-06-25 17:35:30.872 DEBUG receive_response_headers.started request=<Request [b'POST']>
2024-06-25 17:35:31.751 DEBUG receive_response_headers.complete return_value=(b'HTTP/1.1', 200, b'OK', [(b'Date', b'Tue, 25 Jun 2024 09:35:32 GMT'), (b'Content-Type', b'application/json'), (b'Transfer-Encoding', b'chunked'), (b'Connection', b'keep-alive'), (b'openai-organization', b'dirawong'), (b'openai-processing-ms', b'545'), (b'openai-version', b'2020-10-01'), (b'strict-transport-security', b'max-age=31536000; includeSubDomains'), (b'x-ratelimit-limit-requests', b'10000'), (b'x-ratelimit-limit-tokens', b'60000'), (b'x-ratelimit-remaining-requests', b'9991'), (b'x-ratelimit-remaining-tokens', b'59472'), (b'x-ratelimit-reset-requests', b'1m14.515s'), (b'x-ratelimit-reset-tokens', b'528ms'), (b'x-request-id', b'c0541fcfa087338b1eef3e307ea07e11'), (b'CF-Cache-Status', b'DYNAMIC'), (b'Server', b'cloudflare'), (b'CF-RAY', b'8993f7cc2c3a045c-HKG'), (b'Content-Encoding', b'gzip'), (b'alt-svc', b'h3=":443"; ma=86400')])
2024-06-25 17:35:31.755 INFO HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
2024-06-25 17:35:31.755 DEBUG receive_response_body.started request=<Request [b'POST']>
2024-06-25 17:35:31.756 DEBUG receive_response_body.complete
2024-06-25 17:35:31.757 DEBUG response_closed.started
2024-06-25 17:35:31.757 DEBUG response_closed.complete
2024-06-25 17:35:31.757 DEBUG HTTP Request: POST https://api.openai.com/v1/chat/completions "200 OK"
2024-06-25 17:35:31.763 DEBUG Looking via __find_tracker; found <frame at 0x0000020B8CC49F60, file 'C:\\Users\\admin\\AppData\\Local\\Programs\\Python\\Python310\\lib\\site-packages\\trulens_eval\\feedback\\provider\\endpoint\\base.py', line 574, code _track_costs>
2024-06-25 17:35:31.764 DEBUG Handling callback_class: <class 'trulens_eval.feedback.provider.endpoint.openai.OpenAICallback'>.
2024-06-25 17:35:31.764 DEBUG Handling endpoint openai.
2024-06-25 17:35:31.764 DEBUG Handling openai instrumented call to func: <function Completions.create at 0x0000020B81E470A0>,
bindings: <BoundArguments (self=<openai.resources.chat.completions.Completions object at 0x0000020B8A55D8A0>, messages=[{'role': 'system', 'content': 'You are a RELEVANCE grader; providing the relevance of the given CONTEXT to the given QUESTION.\n Respond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\n A few additional scoring guidelines:\n\n - Long CONTEXTS should score equally well as short CONTEXTS.\n\n - RELEVANCE score should increase as the CONTEXTS provides more RELEVANT context to the QUESTION.\n\n - RELEVANCE score should increase as the CONTEXTS provides RELEVANT context to more parts of the QUESTION.\n\n - CONTEXT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n - CONTEXT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n - CONTEXT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n - CONTEXT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n - Never elaborate.'}, {'role': 'user', 'content': 'QUESTION: how do owls fly\n\n CONTEXT: All owls have the same general appearance, which is characterized by a flat face with a small hooked <a href="https://www.britannica.com/science/beak" class="md-crosslink autoxref" data-show-preview="true">beak</a> and large, forward-facing eyes. The tail is short and the wings are rounded. Like the diurnal <a href="https://www.britannica.com/animal/bird-of-prey" class="md-crosslink" data-show-preview="true">birds of prey</a> (order <a href="https://www.britannica.com/animal/falconiform" class="md-crosslink" data-show-preview="true">Falconiformes</a>), they have large feet with sharp talons. \n \n \nPlease answer using the entire template below.\n\nTEMPLATE: \nScore: <The score 0-10 based on the given criteria>\nCriteria: <Provide the criteria for this evaluation>\nSupporting Evidence: <Provide your reasons for scoring based on the listed criteria step by step. Tie it back to the evaluation being completed.>\n '}], model='gpt-3.5-turbo', seed=123, temperature=0.0)>,
response: ChatCompletion(id='chatcmpl-9dwsFaBFGQ9Re1ED40tOS8t1faoaz', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='TEMPLATE: \nScore: \nCriteria: Relevance of the context to the question asked.\nSupporting Evidence: ', role='assistant', function_call=None, tool_calls=None))], created=1719308131, model='gpt-3.5-turbo-0125', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=23, prompt_tokens=501, total_tokens=524))
2024-06-25 17:35:31.767 WARNING Feedback Function exception caught: Traceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\feedback.py", line 865, in run
result_and_meta, part_cost = mod_base_endpoint.Endpoint.track_all_costs_tally(
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 496, in track_all_costs_tally
result, cbs = Endpoint.track_all_costs(
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 477, in track_all_costs
return Endpoint._track_costs(
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\endpoint\base.py", line 574, in _track_costs
result: T = __func(*args, **kwargs)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\base.py", line 349, in context_relevance_with_cot_reasons
return self.generate_score_and_reasons(
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\provider\base.py", line 208, in generate_score_and_reasons
score = mod_generated_utils.re_0_10_rating(line) / normalize
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\utils\generated.py", line 70, in re_0_10_rating
raise ParseError("int or float number", s, pattern=PATTERN_INTEGER)
trulens_eval.utils.generated.ParseError: Tried to find int or float number using pattern ([+-]?[1-9][0-9]*|0) in
Score:
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\trulens_eval\feedback\feedback.py", line 871, in run
raise RuntimeError(
RuntimeError: Evaluation of Context_Relevance failed on inputs:
{'context': 'All owls have the same general appearance, which is characterized '
'by a flat face with a small hooked.
these are the logs that im getting whenever context_relevance would fail
please see above of the other logs to compare
also the failure rate is almost at 3/5 (where sometimes it would work 2/5 times, it would give an score, but most of the time, it would run to a "trulens_eval.utils.generated.ParseError: Tried to find int or float number using pattern ([+-]?[1-9][0-9]*|0) in Score:" always
Is the france test from your code representative of your problem, i.e. failing 3/5 times?
f_context_relevance("what is the capital of France?","France is a unitary, semi-presidential republic with its capital in Paris, it's largest city and main cultural and commercial center.")
Closing this for now. @kaoru-will pls reopen if you need help
Bug Description The feedback function Context Relevance I am trying to use provides inconsistent answers. Sometimes it would give a successful Feedback, but most times it would give a Failed Feedback
To Reproduce Which steps should someone take to run into the same error? A small, reproducible code example is useful here. My Code
Expected behavior A clear and concise description of what you expected to happen. I have taken note that feedbacks dont immediately provide their respective result, which is why I had used wait_for_feedback_results so I can itirate per feedback result and get my supposed results
These are the logs that I get if it passes. It would only pass sometimes when I freshly run my code
Relevant Logs/Tracebacks Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.
Environment:
Additional context Add any other context about the problem here. Is there a way where we can check if we're getting the context properly? Is this an issue where the context is not yet set, but the feedback suddenly runs?
This is the calls that im getting if the feedback passes