truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.05k stars 177 forks source link

[BUG] Use Anthropic Claude 3 Opus as Feedback Provider #992

Closed aabor closed 5 months ago

aabor commented 6 months ago

Bug Description TruLens recorder works only with OpenAI() models. When I try Anthropic or TogetherAI APIs exceptions emerges.

To Reproduce

import logging
from time import sleep

import numpy as np
from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader
)

from trulens_eval import TruLlama, Feedback, Tru
from trulens_eval.app import App
from trulens_eval.feedback import Groundedness
from trulens_eval.feedback.provider.litellm import LiteLLM as fLiteLLM
from trulens_eval.feedback.provider.openai import OpenAI as fOpenAI

# load documents
documents = SimpleDirectoryReader("./data/paul_graham/").load_data()

# build index
index = VectorStoreIndex.from_documents(documents=documents)

# tru = Tru('sqlite:///default.sqlite')
tru = Tru()

llm = fOpenAI()
llm = fLiteLLM(model_engine="claude-3-opus-20240229")
llm = fLiteLLM(model_engine="together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1")

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[],
)
context = App.select_context(query_engine)

# grounded = Groundedness(groundedness_provider=fOpenAI())
grounded = Groundedness(groundedness_provider=llm)
f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons)
    .on(context.collect())  # collect context chunks into a list
    .on_output()
    .aggregate(grounded.grounded_statements_aggregator)
)

f_qa_relevance = Feedback(llm.relevance).on_input_output()

# Question/statement relevance between question and each context chunk.
f_qs_relevance = (
    Feedback(llm.qs_relevance)
    .on_input()
    .on(context)
    .aggregate(np.mean)
)

logging.info("Creating TruLens Recorder")
app_id = "app_id_01"

tru_rag = TruLlama(
    query_engine,
    app_id=app_id,
    feedbacks=[f_groundedness, f_qa_relevance, f_qs_relevance]
)

logging.info("Computing RAG triad metrics")
for questions in ["How do you do?"]:
    for idx, question in enumerate(questions):
        with tru_rag as recording:
            logging.info(f"Question: {question}")
            response = query_engine.query(question)
            logging.info(f"response {response}")
            msg = "\n".join([
                f"Question {idx}: %s" % question,
                "Answer: %s" % response
            ])
            logging.info(msg)
            sleep(0.1)
            # The record of the app invocation can be retrieved from the `recording`:
            # logging.info(f"recording.records %s" % recording.records)

Expected behavior The recorder should log the questions and answers into Sqlite database with calculated metrics. It just logs questions and answers.

Relevant Logs/Tracebacks 2024-03-11 17:37:38,066:INFO:Computing RAG triad metrics 2024-03-11 17:37:52,821:INFO:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" 2024-03-11 17:37:53,036:INFO:Context impl SQLiteImpl. 2024-03-11 17:37:53,037:INFO:Will assume non-transactional DDL. 2024-03-11 17:37:53,169:INFO:✅ added record record_hash_beef086a4fb01e5aaa1a58c895e581eb 2024-03-11 17:37:53,175:INFO:Context impl SQLiteImpl. 2024-03-11 17:37:53,175:INFO:Will assume non-transactional DDL. 17:37:53 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:37:53,370:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 17:37:53 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:37:53,473:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' -

17:37:53 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:37:53,496:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. 2024-03-11 17:37:53,727:ERROR:litellm request failed <class 'litellm.exceptions.BadRequestError'>=AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}. Retries remaining=3. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. 2024-03-11 17:37:53,751:ERROR:litellm request failed <class 'litellm.exceptions.BadRequestError'>=AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}. Retries remaining=3. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. 2024-03-11 17:37:53,896:ERROR:litellm request failed <class 'litellm.exceptions.BadRequestError'>=AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}. Retries remaining=3. 17:37:55 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:37:55,733:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 17:37:55 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:37:55,755:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 17:37:55 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:37:55,901:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. 2024-03-11 17:37:56,111:ERROR:litellm request failed <class 'litellm.exceptions.BadRequestError'>=AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}. Retries remaining=2. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. 2024-03-11 17:37:56,177:ERROR:litellm request failed <class 'litellm.exceptions.BadRequestError'>=AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}. Retries remaining=2. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. 2024-03-11 17:37:56,900:ERROR:litellm request failed <class 'litellm.exceptions.BadRequestError'>=AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}. Retries remaining=2. 17:38:00 - LiteLLM:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:38:00,117:INFO: POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - 2024-03-11 17:38:12,760:INFO:Context impl SQLiteImpl. 2024-03-11 17:38:12,761:INFO:Will assume non-transactional DDL. 2024-03-11 17:38:12,767:INFO:🛑 feedback result qs_relevance FAILED feedback_result_hash_86510e3dc727b3009952db519baa9741

Environment: Hardware Overview: Model Name: MacBook Pro Model Identifier: Mac14,9 Model Number: MPHE3LL/A Chip: Apple M2 Pro Total Number of Cores: 10 (6 performance and 4 efficiency) Memory: 16 GB System Firmware Version: 10151.81.1 OS Loader Version: 10151.81.1

Python 3.11.6 (v3.11.6:8b6ee5ba3b, Oct 2 2023, 11:18:21) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin

faiss_cpu==1.8.0 FlagEmbedding==1.2.5 langchain_community==0.0.24 langchain_core==0.1.28 langchain-together==0.0.2.post1 langchain-anthropic==0.1.4 litellm==1.23.16 trulens_eval==0.25.0 llama_index==0.10.16 llama-index-core=0.10.16 llama-index-llms-litellm==0.1.3 llama-index-llms-anthropic==0.1.5 llama-index-embeddings-huggingface==0.1.4 llama-index-vector-stores-faiss==0.1.2 llama-index-llms-together==0.1.3 llama-index-embeddings-together==0.1.2 llama-index-embeddings-voyageai==0.1.3 llama-index-postprocessor-cohere-rerank==0.1.2 sentence_transformers==2.3.1 nest_asyncio==1.6.0 numpy==1.26.4 openai==1.13.3 torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 pypdf==4.0.2 python-dotenv==1.0.1 tiktoken==0.5.2 watchdog==4.0.0 voyageai==0.1.7

dosubot[bot] commented 6 months ago

Hey @aabor, nice to see you back here! Hope you're doing great.

Based on the error message you're receiving, it seems like the model "claude-3-opus-20240229" is not supported by the API you're using. The error message suggests using the Messages API instead.

The litellm package is trying to use the model "claude-3-opus-20240229" with the API endpoint https://api.anthropic.com/v1/complete, but it appears that this model is not supported on this endpoint.

You might want to check if the model "claude-3-opus-20240229" is indeed supported by the Anthropic API and if it's compatible with the endpoint you're using. If the model is not supported, you might need to use a different model. If the model is supported but not on the endpoint you're using, you might need to switch to the suggested Messages API.

Regarding the TruLens recorder issue, it seems like the recorder is designed to work with OpenAI models. If you're trying to use it with models from Anthropic or TogetherAI APIs, you might encounter exceptions.

You might want to check if there's a way to make the TruLens recorder compatible with models from other providers. This could involve modifying the recorder's code or using some sort of adapter or wrapper to make the models from other providers compatible with the recorder.

If you're still encountering issues, could you provide more details about how you're using the trulens_eval and litellm packages, and how the model "claude-3-opus-20240229" is specified in your code? This could help in identifying the cause of the issue and suggesting a more specific solution.

Sources

#### About Dosu This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.
piotrm0 commented 6 months ago

Hi; is the issue coming from litellm? Can you check whether litellm without the use of trulens operates as expected on the models you want to run and whether litellm's cost tracking work on them?

aabor commented 6 months ago

Litellm works separately. For example, the code below gave expected output:


from litellm import completion

messages = [{"role": "user", "content": "Write me a poem about the blue sky"}]

response = completion(model="together_ai/meta-llama/Llama-2-70b-chat-hf", messages=messages)

print(response)

# ModelResponse(id='863850c3ba2befb4-PDX', choices=[Choices(finish_reason='eos', index=0, message=Message(content="  Sure! Here's a short poem about the blue sky:\n\nThe blue sky stretches far and wide,\nA canvas painted by the sun's bright side.\nDotted with clouds, so fluffy and white,\nIt's a sight that's simply out of sight.\n\nThe blue sky brings us joy and cheer,\nA symbol of hope, and a promise clear.\nIt's a reminder of the beauty above,\nA reflection of the love that we have.\n\nSo let's bask in the blue sky's embrace,\nAnd let our spirits soar with grace.\nFor in its depths, we'll find our peace,\nA sense of calm, a world to cease.\n\nThe blue sky's magic is pure and true,\nA treasure that's waiting for me and you.\nSo let's gaze up, and let our hearts sing,\nIn the blue sky's splendor, our spirits will take wing.", role='assistant'))], created=1710294024, model='meta-llama/Llama-2-70b-chat-hf', object='chat.completion', system_fingerprint=None, usage=Usage(completion_tokens=220, prompt_tokens=17, total_tokens=237))
joshreini1 commented 6 months ago

@aabor - it looks like the model you’re trying to use with LiteLLM/TruLens is Claude-3-opus. Can you confirm that this model works with LiteLLM? Your test above shows just Llama-2

aabor commented 6 months ago
from litellm import completion

messages = [{"role": "user", "content": "Hey! how's it going?"}]
response = completion(model="claude-3-opus-20240229", messages=messages)
print(response)

07:45:17 - LiteLLM:INFO:

POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' -H -d '{'model': 'claude-3-opus-20240229', 'prompt': "\n\nHuman: Hey! how's it going?\n\nAssistant: ", 'max_tokens_to_sample': 256}'

2024-03-13 07:45:17,915:INFO:

POST Request Sent from LiteLLM: curl -X POST \ https://api.anthropic.com/v1/complete \ -H 'accept: application/json' -H 'anthropic-version: 2023-06-01' -H 'content-type: application/json' - -d '{'model': 'claude-3-opus-20240229', 'prompt': "\n\nHuman: Hey! how's it going?\n\nAssistant: ", 'max_tokens_to_sample': 256}'

Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/main.py", line 1020, in completion response = anthropic.completion( ^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/llms/anthropic.py", line 170, in completion raise AnthropicError( litellm.llms.anthropic.AnthropicError: {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/utils.py", line 2481, in wrapper raise e File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/utils.py", line 2384, in wrapper result = original_function(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/main.py", line 1897, in completion raise exception_type( ^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/utils.py", line 7520, in exception_type raise e File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/litellm/utils.py", line 6478, in exception_type raise BadRequestError( litellm.exceptions.BadRequestError: AnthropicException - {"type":"error","error":{"type":"invalid_request_error","message":"\"claude-3-opus-20240229\" is not supported on this API. Please use the Messages API instead."}}

This model works when I use other packages to access, for example, llama_index

from llama_index.llms.anthropic import Anthropic as LlamaIndexAnthropic
joshreini1 commented 6 months ago

Can you try upgrading your LiteLLM version? See the related LiteLLM issue: https://github.com/BerriAI/litellm/issues/2314

aabor commented 6 months ago

I upgraded to the most recent LiteLLM version, which is 1.31.6. It works well with direct completion function call, also it gives less errors. But still when I use TruLens to estimate performance the following error appears:

2024-03-13 08:14:01,351:INFO:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" 2024-03-13 08:14:01,566:INFO:Context impl SQLiteImpl. 2024-03-13 08:14:01,566:INFO:Will assume non-transactional DDL. 2024-03-13 08:14:01,706:INFO:✅ added record record_hash_d250bdf66726f7230a6ed74cff868ffc 2024-03-13 08:14:01,711:INFO:Context impl SQLiteImpl. 2024-03-13 08:14:01,712:INFO:Will assume non-transactional DDL. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. 2024-03-13 08:14:02,052:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=3. 2024-03-13 08:14:02,061:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=3. 2024-03-13 08:14:02,077:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=3. Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use `litellm.set_verbose=True'.

joshreini1 commented 6 months ago

Can you add litellm.set_verbose=True to get more info on the error?

aabor commented 6 months ago

Ok, I set litellm.set_verbose=True.

All other environment was the same. Make notice that another models "together_ai/meta-llama/Llama-2-70b-chat-hf", "together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1" also gave mistakes when I set it as an engine in the TruLens recorder. Please, test usage of this models with TruLens as well. May be, it is wise to add some automatic tests of this functionality in your CI/CD pipeline.

Here you are the output:

2024-03-14 11:27:28,303:INFO:HTTP Request: POST https://api.anthropic.com/v1/messages "HTTP/1.1 200 OK" 2024-03-14 11:27:28,514:INFO:Context impl SQLiteImpl. 2024-03-14 11:27:28,514:INFO:Will assume non-transactional DDL. 2024-03-14 11:27:28,650:INFO:✅ added record record_hash_77e3c859b487952c27fc1654c61b3930 2024-03-14 11:27:28,654:INFO:Context impl SQLiteImpl. 2024-03-14 11:27:28,654:INFO:Will assume non-transactional DDL. Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a INFORMATION OVERLAP classifier providing the overlap of information between a SOURCE and STATEMENT.\nFor every sentence in the statement, please answer with this template:\n\nTEMPLATE: \nStatement Sentence: , \nSupporting Evidence: <Choose the exact unchanged sentences in the source that can answer the statement, if nothing matches, say NOTHING FOUND>\nScore: <Output a number between 0-10 where 0 is no information overlap and 10 is all information is overlapping>\nGive me the INFORMATION OVERLAP of this SOURCE and STATEMENT.\n\nSOURCE:]) Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long RESPONSES should score equally well as short RESPONSES.\n\n- Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should also be counted as the most RELEVANT.\n\n- RESPONSE must be relevant to the entire PROMPT to get a score of 10.\n\n- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.\n\n- RESPONSE that is RELEVANT to none of the PROMPT should get a score of 0.\n\n- RESPONSE that is RELEVANT to some of the PROMPT should get as score of 2, 3, or 4. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to most of the PROMPT should get a score between a 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to the entire PROMPT should get a score of 9 or 10.\n\n- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 10.\n\n- RESPONSE that confidently FALSE should get a score of 0.\n\n- RESPONSE that is only seemingly RELEVANT should get a score of 0.\n\n- Never elaborate.\n\nPROMPT: "}]) self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given STATEMENT to the given QUESTION.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long STATEMENTS should score equally well as short STATEMENTS.\n\n- RELEVANCE score should increase as the STATEMENT provides more RELEVANT context to the QUESTION.\n\n- RELEVANCE score should increase as the STATEMENT provides RELEVANT context to more parts of the QUESTION.\n\n- STATEMENT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n- STATEMENT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n- Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most relevant.\n\n- Never elaborate.\n\nQUESTION: }]) Logging Details LiteLLM-Failure Call Logging Details LiteLLM-Failure Call self.failure_callback: [] self.failure_callback: [] self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False Logging Details LiteLLM-Failure Call 2024-03-14 11:27:28,987:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=3. 2024-03-14 11:27:28,989:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=3. 2024-03-14 11:27:29,000:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=3. self.failure_callback: [] Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long RESPONSES should score equally well as short RESPONSES.\n\n- Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should also be counted as the most RELEVANT.\n\n- RESPONSE must be relevant to the entire PROMPT to get a score of 10.\n\n- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.\n\n- RESPONSE that is RELEVANT to none of the PROMPT should get a score of 0.\n\n- RESPONSE that is RELEVANT to some of the PROMPT should get as score of 2, 3, or 4. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to most of the PROMPT should get a score between a 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to the entire PROMPT should get a score of 9 or 10.\n\n- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 10.\n\n- RESPONSE that confidently FALSE should get a score of 0.\n\n- RESPONSE that is only seemingly RELEVANT should get a score of 0.\n\n- Never elaborate.\n\nPROMPT: E: "}]) self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a INFORMATION OVERLAP classifier providing the overlap of information between a SOURCE and STATEMENT.\nFor every sentence in the statement, please answer with this template:\n\nTEMPLATE: \nStatement Sentence: <Sentence>, \nSupporting Evidence: <Choose the exact unchanged sentences in the source that can answer the statement, if nothing matches, say NOTHING FOUND>\nScore: <Output a number between 0-10 where 0 is no information overlap and 10 is all information is overlapping>\nGive me the INFORMATION OVERLAP of this SOURCE and STATEMENT.\n\nSOURCE: [\n"}]) self.optional_params: {} kwargs[caching]: False; litellm.cache: None Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/newFinal returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False Logging Details: logger_fn - None | callable(logger_fn) - False Request to litellm: Logging Details LiteLLM-Failure Calllitellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given STATEMENT to the given QUESTION.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long STATEMENTS should score equally well as short STATEMENTS.\n\n- RELEVANCE score should increase as the STATEMENT provides more RELEVANT context to the QUESTION.\n\n- RELEVANCE score should increase as the STATEMENT provides RELEVANT context to more parts of the QUESTION.\n\n- STATEMENT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n- STATEMENT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n- Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most relevant.\n\n- Never elaborate.\n\nQUESTION: NCE: "}]) Logging Details LiteLLM-Failure Call self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False Logging Details LiteLLM-Failure Call self.failure_callback: [] self.failure_callback: [] 2024-03-14 11:27:31,043:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=2. 2024-03-14 11:27:31,048:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=2. 2024-03-14 11:27:31,051:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=2. self.failure_callback: [] Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a INFORMATION OVERLAP classifier providing the overlap of information between a SOURCE and STATEMENT.\nFor every sentence in the statement, please answer with this template:\n\nTEMPLATE: \nStatement Sentence: , \nSupporting Evidence: <Choose the exact unchanged sentences in the source that can answer the statement, if nothing matches, say NOTHING FOUND>\nScore: <Output a number between 0-10 where 0 is no information overlap and 10 is all information is overlapping>\nGive me the INFORMATION OVERLAP of this SOURCE and STATEMENT.\n\nSOURCE: ["}]) self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long RESPONSES should score equally well as short RESPONSES.\n\n- Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should also be counted as the most RELEVANT.\n\n- RESPONSE must be relevant to the entire PROMPT to get a score of 10.\n\n- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.\n\n- RESPONSE that is RELEVANT to none of the PROMPT should get a score of 0.\n\n- RESPONSE that is RELEVANT to some of the PROMPT should get as score of 2, 3, or 4. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to most of the PROMPT should get a score between a 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to the entire PROMPT should get a score of 9 or 10.\n\n- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 10.\n\n- RESPONSE that confidently FALSE should get a score of 0.\n\n- RESPONSE that is only seemingly RELEVANT should get a score of 0.\n\n- Never elaborate.\n\nPROMPT:VANCE: "}]) Request to litellm:self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'. litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given STATEMENT to the given QUESTION.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long STATEMENTS should score equally well as short STATEMENTS.\n\n- RELEVANCE score should increase as the STATEMENT provides more RELEVANT context to the QUESTION.\n\n- RELEVANCE score should increase as the STATEMENT provides RELEVANT context to more parts of the QUESTION.\n\n- STATEMENT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n- STATEMENT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n- Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most relevant.\n\n- Never elaborate.\n\nQUESTION: \n\nRELEVANCE: "}]) Logging Details LiteLLM-Failure Call self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False Logging Details: logger_fn - None | callable(logger_fn) - False Logging Details LiteLLM-Failure Call Logging Details LiteLLM-Failure Call self.failure_callback: [] self.failure_callback: [] self.failure_callback: [] 2024-03-14 11:27:35,107:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=1. 2024-03-14 11:27:35,110:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=1. 2024-03-14 11:27:35,111:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=1. Request to litellm: litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given RESPONSE to the given PROMPT.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long RESPONSES should score equally well as short RESPONSES.\n\n- Answers that intentionally do not answer the question, such as 'I don't know' and model refusals, should also be counted as the most RELEVANT.\n\n- RESPONSE must be relevant to the entire PROMPT to get a score of 10.\n\n- RELEVANCE score should increase as the RESPONSE provides RELEVANT context to more parts of the PROMPT.\n\n- RESPONSE that is RELEVANT to none of the PROMPT should get a score of 0.\n\n- RESPONSE that is RELEVANT to some of the PROMPT should get as score of 2, 3, or 4. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to most of the PROMPT should get a score between a 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- RESPONSE that is RELEVANT to the entire PROMPT should get a score of 9 or 10.\n\n- RESPONSE that is RELEVANT and answers the entire PROMPT completely should get a score of 10.\n\n- RESPONSE that confidently FALSE should get a score of 0.\n\n- RESPONSE that is only seemingly RELEVANT should get a score of 0.\n\n- Never elaborate.\n\nPROMPT: та Российской Федерации.\n\nRELEVANCE: "}]) self.optional_params: {} kwargs[caching]: False; litellm.cache: None Request to litellm:Final returned optional params: {} self.optional_params: {} Request to litellm: Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'.litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a RELEVANCE grader; providing the relevance of the given STATEMENT to the given QUESTION.\nRespond only as a number from 0 to 10 where 0 is the least relevant and 10 is the most relevant. \n\nA few additional scoring guidelines:\n\n- Long STATEMENTS should score equally well as short STATEMENTS.\n\n- RELEVANCE score should increase as the STATEMENT provides more RELEVANT context to the QUESTION.\n\n- RELEVANCE score should increase as the STATEMENT provides RELEVANT context to more parts of the QUESTION.\n\n- STATEMENT that is RELEVANT to some of the QUESTION should score of 2, 3 or 4. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to most of the QUESTION should get a score of 5, 6, 7 or 8. Higher score indicates more RELEVANCE.\n\n- STATEMENT that is RELEVANT to the entire QUESTION should get a score of 9 or 10. Higher score indicates more RELEVANCE.\n\n- STATEMENT must be relevant and helpful for answering the entire QUESTION to get a score of 10.\n\n- Answers that intentionally do not answer the question, such as 'I don't know', should also be counted as the most relevant.\n\n- Never elaborate.\n\nQUESTION: .\n\nRELEVANCE: "}]) Logging Details: logger_fn - None | callable(logger_fn) - False self.optional_params: {} kwargs[caching]: False; litellm.cache: None litellm.completion(model='claude-3-opus-20240229', messages=[{'role': 'system', 'content': "You are a INFORMATION OVERLAP classifier providing the overlap of information between a SOURCE and STATEMENT.\nFor every sentence in the statement, please answer with this template:\n\nTEMPLATE: \nStatement Sentence: , \nSupporting Evidence: <Choose the exact unchanged sentences in the source that can answer the statement, if nothing matches, say NOTHING FOUND>\nScore: <Output a number between 0-10 where 0 is no information overlap and 10 is all information is overlapping>\nGive me the INFORMATION OVERLAP of this SOURCE and STATEMENT.\n\nSOURCE: ['.\n"}]) Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, use litellm.set_verbose=True'. Logging Details: logger_fn - None | callable(logger_fn) - False self.optional_params: {} kwargs[caching]: False; litellm.cache: None Final returned optional params: {} self.optional_params: {} Give Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new LiteLLM.Info: If you need to debug this error, uselitellm.set_verbose=True'.Logging Details LiteLLM-Failure Call Logging Details: logger_fn - None | callable(logger_fn) - False Logging Details LiteLLM-Failure Call Logging Details LiteLLM-Failure Call self.failure_callback: [] self.failure_callback: [] self.failure_callback: [] 2024-03-14 11:27:43,174:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=0. 2024-03-14 11:27:43,178:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=0. 2024-03-14 11:27:43,178:ERROR:litellm request failed <class 'litellm.exceptions.APIConnectionError'>=list index out of range. Retries remaining=0. 2024-03-14 11:27:43,179:WARNING:Feedback Function exception caught: Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/feedback.py", line 627, in run result_and_meta, part_cost = Endpoint.track_all_costs_tally( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 496, in track_all_costs_tally result, cbs = Endpoint.track_all_costs( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 477, in track_all_costs return Endpoint._track_costs( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 574, in _track_costs result: T = __func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/base.py", line 347, in relevance return self.generate_score( ^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/base.py", line 156, in generate_score response = self.endpoint.run_in_pace( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 308, in run_in_pace raise RuntimeError( RuntimeError: Endpoint litellm request failed 4 time(s): list index out of range list index out of range list index out of range list index out of range The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/feedback.py", line 633, in run raise RuntimeError( RuntimeError: Evaluation of relevance failed on inputs: {'prompt': 'Сколько раз одно и то же лицо может замещать должность ' 'руководителя одной и той же государственной или. 2024-03-14 11:27:43,181:WARNING:Feedback Function exception caught: Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/feedback.py", line 627, in run result_and_meta, part_cost = Endpoint.track_all_costs_tally( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 496, in track_all_costs_tally result, cbs = Endpoint.track_all_costs( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 477, in track_all_costs return Endpoint._track_costs( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 574, in _track_costs result: T = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/groundedness.py", line 116, in groundedness_measure_with_cot_reasons reason = self.groundedness_provider._groundedness_doc_in_out( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/base.py", line 125, in _groundedness_doc_in_out return self.endpoint.run_in_pace( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 308, in run_in_pace raise RuntimeError( RuntimeError: Endpoint litellm request failed 4 time(s): list index out of range list index out of range list index out of range list index out of range The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/feedback.py", line 633, in run raise RuntimeError( RuntimeError: Evaluation of groundedness_measure_with_cot_reasons failed on inputs: {'source': ['Одно и то же лицо не может замещать должность руководителя одной ' 'и той же\n' 'государств. 2024-03-14 11:27:43,182:WARNING:Feedback Function exception caught: Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/feedback.py", line 627, in run result_and_meta, part_cost = Endpoint.track_all_costs_tally( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 496, in track_all_costs_tally result, cbs = Endpoint.track_all_costs( ^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 477, in track_all_costs return Endpoint._track_costs( ^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 574, in _track_costs result: T = func(*args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/base.py", line 271, in qs_relevance return self.generate_score( ^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/base.py", line 156, in generate_score response = self.endpoint.run_in_pace( ^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/provider/endpoint/base.py", line 308, in run_in_pace raise RuntimeError( RuntimeError: Endpoint litellm request failed 4 time(s): list index out of range list index out of range list index out of range list index out of range The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/aabor/projects/rag/venv/lib/python3.11/site-packages/trulens_eval/feedback/feedback.py", line 633, in run raise RuntimeError( RuntimeError: Evaluation of qs_relevance failed on inputs: {'question': 'Сколько раз одно и то же лицо может замещать должность ' 'руководителя одной и той же государственной. 2024-03-14 11:27:43,183:INFO:Context impl SQLiteImpl. 2024-03-14 11:27:43,184:INFO:Will assume non-transactional DDL. 2024-03-14 11:27:43,186:INFO:Context impl SQLiteImpl. 2024-03-14 11:27:43,187:INFO:Context impl SQLiteImpl. 2024-03-14 11:27:43,187:INFO:Will assume non-transactional DDL. 2024-03-14 11:27:43,187:INFO:Will assume non-transactional DDL. 2024-03-14 11:27:43,197:INFO:🛑 feedback result relevance FAILED feedback_result_hash_912d1c23ae37b4b0d3e1ef1fed150351 2024-03-14 11:27:43,197:INFO:🛑 feedback result qs_relevance FAILED feedback_result_hash_5284eda989e2faaeb13bf0f87cc8f235 2024-03-14 11:27:43,197:INFO:🛑 feedback result groundedness_measure_with_cot_reasons FAILED feedback_result_hash_7ae47ad25971

piotrm0 commented 6 months ago

Looks like these models don't like having messages with only role of "system" and will complain if no "user" role is given.

piotrm0 commented 6 months ago

That is, the problem is with the prompt we are giving to evaluate some of these eval metrics. Will look more into it.

piotrm0 commented 6 months ago

Does not work:


input_dict = {'messages': [{'content': 'You are a RELEVANCE grader; providing the relevance '
                          'of the given CONTEXT to the given QUESTION.\n'
                          'Respond only as a number from 0 to 10 where 0 is '
                          'the least relevant and 10 is the most relevant. \n'
                          '\n'
                          'A few additional scoring guidelines:\n'
                          '\n'
                          '- Long CONTEXTS should score equally well as short '
                          'CONTEXTS.\n'
                          '\n'
                          '- RELEVANCE score should increase as the CONTEXTS '
                          'provides more RELEVANT context to the QUESTION.\n'
                          '\n'
                          '- RELEVANCE score should increase as the CONTEXTS '
                          'provides RELEVANT context to more parts of the '
                          'QUESTION.\n'
                          '\n'
                          '- CONTEXT that is RELEVANT to some of the QUESTION '
                          'should score of 2, 3 or 4. Higher score indicates '
                          'more RELEVANCE.\n'
                          '\n'
                          '- CONTEXT that is RELEVANT to most of the QUESTION '
                          'should get a score of 5, 6, 7 or 8. Higher score '
                          'indicates more RELEVANCE.\n'
                          '\n'
                          '- CONTEXT that is RELEVANT to the entire QUESTION '
                          'should get a score of 9 or 10. Higher score '
                          'indicates more RELEVANCE.\n'
                          '\n'
                          '- CONTEXT must be relevant and helpful for '
                          'answering the entire QUESTION to get a score of '
                          '10.\n'
                          '\n'
                          '- Never elaborate.\n'
                          '\n'
                          'QUESTION: Where is Poland?\n'
                          '\n'
                          'CONTEXT: Germany is a country in Europe.\n'
                          '\n'
                          'RELEVANCE: ',
               'role': 'system'}],
 'model': 'together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1'}

temp = completion(**input_dict)

Works (only change is role)


input_dict = {'messages': [{'content': 'You are a RELEVANCE grader; providing the relevance '
                          'of the given CONTEXT to the given QUESTION.\n'
                          'Respond only as a number from 0 to 10 where 0 is '
                          'the least relevant and 10 is the most relevant. \n'
                          '\n'
                          'A few additional scoring guidelines:\n'
                          '\n'
                          '- Long CONTEXTS should score equally well as short '
                          'CONTEXTS.\n'
                          '\n'
                          '- RELEVANCE score should increase as the CONTEXTS '
                          'provides more RELEVANT context to the QUESTION.\n'
                          '\n'
                          '- RELEVANCE score should increase as the CONTEXTS '
                          'provides RELEVANT context to more parts of the '
                          'QUESTION.\n'
                          '\n'
                          '- CONTEXT that is RELEVANT to some of the QUESTION '
                          'should score of 2, 3 or 4. Higher score indicates '
                          'more RELEVANCE.\n'
                          '\n'
                          '- CONTEXT that is RELEVANT to most of the QUESTION '
                          'should get a score of 5, 6, 7 or 8. Higher score '
                          'indicates more RELEVANCE.\n'
                          '\n'
                          '- CONTEXT that is RELEVANT to the entire QUESTION '
                          'should get a score of 9 or 10. Higher score '
                          'indicates more RELEVANCE.\n'
                          '\n'
                          '- CONTEXT must be relevant and helpful for '
                          'answering the entire QUESTION to get a score of '
                          '10.\n'
                          '\n'
                          '- Never elaborate.\n'
                          '\n'
                          'QUESTION: Where is Poland?\n'
                          '\n'
                          'CONTEXT: Germany is a country in Europe.\n'
                          '\n'
                          'RELEVANCE: ',
               'role': 'user'}],
 'model': 'together_ai/mistralai/Mixtral-8x7B-Instruct-v0.1'}

temp = completion(**input_dict)
joshreini1 commented 5 months ago

Hey @aabor - this should be fixed by #1018 and merged/released today. Please take a look!

This change also includes new examples to show use of TogetherAI (in the LiteLLM quickstart) and Claude-3 in its own example notebook.

joshreini1 commented 5 months ago

Hey @aabor - this is released in 0.27.0!

Please upgrade your trulens-eval version pip install -U trulens-eval and checkout the new examples: