truera / trulens

Evaluation and Tracking for LLM Experiments
https://www.trulens.org/
MIT License
2.05k stars 178 forks source link

[BUG]Metric calculation and corresponding interpretation #1290

Closed kmr666 closed 2 months ago

kmr666 commented 2 months ago

Bug Description What happened? 1)I tried to use other frameworks such as ragas and trulens to calculate context_relevance for my data sets, but the two frameworks gave different results.Is it because of different calculation methods? 2)My understanding is that trulens uses llm to participate in the evaluation. May I ask whether the evaluation of trulens only requires the designation of an llm and does not require the calculation of indicators using the embedding model? 3)Are there any relationships and differences between context_relevance and context_relevance_with_cot_reasons for metrics in trulens? To Reproduce Which steps should someone take to run into the same error? A small, reproducible code example is useful here.

Expected behavior A clear and concise description of what you expected to happen.

Relevant Logs/Tracebacks Please copy and paste any relevant log output. This will be automatically formatted into code, so no need for backticks. If the issue is related to the TruLens dashboard, please also include a screenshot.

Environment:

Additional context Add any other context about the problem here.

dosubot[bot] commented 2 months ago
  1. Ragas and TruLens frameworks provide different results for context_relevance because they use different models and methodologies for evaluating context relevance. TruLens uses various models like GPT-3.5, GPT-4, and Claude-2, each with its own implementation of the context_relevance feedback function, which can lead to different results due to differences in model architecture, training data, and evaluation techniques [1][2].

  2. TruLens does indeed use LLMs (Large Language Models) to participate in the evaluation. The evaluation in TruLens primarily involves the designation of an LLM and does not require the calculation of indicators using an embedding model [2].

  3. The context_relevance metric in TruLens is a straightforward measure of how relevant a given context is to a query. In contrast, context_relevance_with_cot_reasons includes additional reasoning steps (chain-of-thought) to justify the relevance score. This additional reasoning can provide more nuanced and potentially more accurate relevance assessments, but it also introduces variability depending on how the reasoning is generated and interpreted by the model [1][3].

To continue talking to Dosu, mention @dosu.

sfc-gh-jreini commented 2 months ago

@dosu is right here. We will be adding new context relevance benchmarks to our benchmarks page soon.

dosubot[bot] commented 2 months ago
  1. Ragas and TruLens frameworks provide different results for context_relevance because they use different models and methodologies for evaluating context relevance. TruLens uses various models like GPT-3.5, GPT-4, and Claude-2, each with its own implementation of the context_relevance feedback function, which can lead to different results due to differences in model architecture, training data, and evaluation techniques [1][2].

  2. TruLens does indeed use LLMs (Large Language Models) to participate in the evaluation. The evaluation in TruLens primarily involves the designation of an LLM and does not require the calculation of indicators using an embedding model [2].

  3. The context_relevance metric in TruLens is a straightforward measure of how relevant a given context is to a query. In contrast, context_relevance_with_cot_reasons includes additional reasoning steps (chain-of-thought) to justify the relevance score. This additional reasoning can provide more nuanced and potentially more accurate relevance assessments, but it also introduces variability depending on how the reasoning is generated and interpreted by the model [1][3].

We will be adding new context relevance benchmarks to our benchmarks page soon.

To continue talking to Dosu, mention @dosu.