langchain-ai / langsmith-sdk

LangSmith Client SDK Implementations
https://smith.langchain.com/
MIT License
385 stars 71 forks source link

Summary Evaluation is not visible on Experiments WebUI #830

Closed yusuke-intern closed 1 month ago

yusuke-intern commented 3 months ago

I implemented this following the video, but the f1-score is not displayed on the WebUI. https://www.youtube.com/watch?v=zMgrHzs_cAg

Looking at the evaluators, the calculations are being done, the functions are executed properly, and the values are calculated correctly. However, it is not being displayed as s chart.

hinthornw commented 3 months ago

Could you share please share code?

yusuke-intern commented 3 months ago

@hinthornw Sure, I used summary_evaluators.

def precision(runs: list, examples: list):
    true_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["is_target"] == example.outputs["is_target"]])
    false_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["is_target"] != example.outputs["is_target"]])
    return {"score": true_positives / (true_positives + false_positives), "key": "precision"}

def recall(runs: list, examples: list):
    true_positives = sum([1 for run, example in zip(runs, examples) if run.outputs["is_target"] == example.outputs["is_target"]])
    false_negatives = sum([1 for run, example in zip(runs, examples) if run.outputs["is_target"] != example.outputs["is_target"]])
    return {"score": true_positives / (true_positives + false_negatives), "key": "recall"}

def f1_score_summary_evaluator(runs: List[Run], examples: List[Example]) -> dict:
    true_positives = 0
    false_positives = 0
    false_negatives = 0
    for run, example in zip(runs, examples):
        # Matches the output format of your dataset
        reference = example.outputs["is_target"]
        # Matches the output dict in `predict` function below

        prediction = run.outputs.get("is_target",None)
        if prediction == None:
            false_negatives += 1
            continue

        if reference and prediction == reference:
            true_positives += 1
        elif prediction and not reference:
            false_positives += 1
        elif not prediction and reference:
            false_negatives += 1
    if true_positives == 0:
        return {"key": "f1_score", "score": 0.0}

    precision = true_positives / (true_positives + false_positives)
    recall = true_positives / (true_positives + false_negatives)
    f1_score = 2 * (precision * recall) / (precision + recall)
    return {"key": "f1_score", "score": f1_score}

results = evaluate(
    lambda inputs: label_text(inputs["body"]),
    data=client.list_examples(dataset_name=dataset_name, as_of="latest"),
    evaluators=[correct_label],
    summary_evaluators=[f1_score_summary_evaluator,precision, recall],
    experiment_prefix="pickup-gemini-subset",
)
yusuke-intern commented 3 months ago

@hinthornw The code used in the video is same as the current version? https://www.youtube.com/watch?v=zMgrHzs_cAg

whatever-afk commented 3 months ago

Same issue, the metrics are not being displayed in the UI despite results._summary_results shows the EvaluationResult with the metrics.

mobiware commented 2 months ago

We're having the same issue. Official documentation says the summary evaluation results should be visible in the experiments UI, but they're not https://docs.smith.langchain.com/how_to_guides/evaluation/evaluate_llm_application#use-a-summary-evaluator

mobiware commented 2 months ago

Also the summary evaluation results aren't really exposed in an official manner in the Langsmith SDK. The only way to programmatically access them is to use results._summary_results as said by @whatever-afk, which is not a public API

0d431 commented 2 months ago

Same here. Also, summary evaluators to not seem to support returning multiple metrics (returning an array of dicts in the "results" key, like it works for the per-run evaluators). Trying to this gave me a "'dict' object has no attribute 'dict'" error.

But ofc, being able to see a single one at all in the UI would be great :)

mobiware commented 2 months ago

@0d431 we didn't have any issue returning multiple evaluation results though. Here's what our summary evaluator function looks like:

def evaluate_precision_recall(
    runs: Sequence[Run], examples: Sequence[Example]
) -> EvaluationResults:
    true_positives_count = 0
    detected_count = 0
    reference_count = 0

    for example, run in zip(examples, runs):
        reference = set((example.outputs and example.outputs.get("data")) or [])
        detected = set((run.outputs and run.outputs.get("data")) or [])
        true_positives = detected.intersection(reference)

        true_positives_count += len(true_positives)
        detected_count += len(detected)
        reference_count += len(reference)

    precision = true_positives_count / detected_count
    recall = true_positives_count / reference_count
    f_score = 2 * precision * recall / (precision + recall)

    return EvaluationResults(
        results=[
            EvaluationResult(
                key="precision",
                score=precision,
            ),
            EvaluationResult(
                key="recall",
                score=recall,
            ),
            EvaluationResult(
                key="f_score",
                score=f_score,
            ),
        ]
    )
0d431 commented 2 months ago

Hi @mobiware - thanks for clarifying, much appreciated! I followed the docs at https://docs.smith.langchain.com/old/evaluation/faq/custom-evaluators which simply return a dict of the form


return {
        "results": [
            # Provide the key, score and other relevant information for each metric
            {"key": "correctness", "score": scores_args["correctness"], "comment": scores_args["correctness_reasoning"]},
            {"key": "conciseness", "score": scores_args["conciseness"], "comment": scores_args["conciseness_reasoning"]}
        ]
    }

That works for the plain evaluators, but not for the summary.

whatever-afk commented 2 months ago

The issue is we are still unable to see the results in the UI despite the code provided by @mobiware actually output the results

{
  "results": [
    {
      "key": "precision",
      "score": 0.8974358974358975,
      "evaluator_info": {}
    },
    {
      "key": "recall",
      "score": 0.9722222222222222,
      "evaluator_info": {}
    },
    {
      "key": "f_score",
      "score": 0.9333333333333333,
      "evaluator_info": {}
    }
  ]
}
shawnli-capix commented 2 months ago

+1. Facing the same issue as well. Summary Metrics can only be viewed by accessing results._summary_results in code, but the metrics don't show up in the web UI.

mobiware commented 2 months ago

In fact as of today I can now see the summary evaluation in Langsmith web UI, including for past Experiments !

@hinthornw should we open a separate ticket for the other issue mentioned in this ticket, i.e. the fact that after running evaluate() the summary results are available only on ExperimentResults._summary_results which is a private attribute. It would be nice to have a public accessor to this attribute

shawnli-capix commented 2 months ago

@mobiware What did you change that makes the summary evaluation show up in the LangSmith Web UI?

mobiware commented 2 months ago

@shawnli-capix I didn't do anything. I just noticed today that it was there

hinthornw commented 1 month ago

Hi all! There were a few versions of langsmith where the summary metrics were no longer being reported. If you upgrade to any of the past 5 or so versions, this should be fixed.

hinthornw commented 1 month ago

I believe this was resolved in the aforementioned changes. LMK if you believe otherwise