[FR] add metrics for traces

harshilprajapati96 commented 3 months ago

Willingness to contribute

Yes. I would be willing to contribute this feature with guidance from the MLflow community.

Proposal Summary

For LLM traces, particularly with the MlflowLangchainTracer, it would be beneficial to include columns for metrics that can be added with each invocation.

Motivation

What is the use case for this feature?

Adding key/value pairs for metrics to traces allows users to efficiently query and investigate traces based on specific metric values or thresholds. For instance, if a user wants to examine traces where a certain metric exceeds a predefined value, having metrics attached to traces as key/value pairs enables precise and targeted querying

Why is this use case valuable to support for MLflow users in general?

Supporting this use case is valuable for MLflow users because it enhances the ability to perform detailed analysis and diagnostics of their experiments. Users can quickly locate and analyze traces that meet certain criteria, which helps in understanding model performance, identifying issues, and making informed decisions based on metrics.

Why is this use case valuable to support for your project(s) or organization?

For our project, capturing metrics as key/value pairs with traces is crucial for efficient investigation and analysis. It simplifies the process of identifying and analyzing traces based on specific performance metrics or thresholds, leading to quicker insights and improved decision-making. This feature supports our objective of delivering a more effective and user-friendly AI platform by streamlining the trace analysis process.

Why is it currently difficult to achieve this use case?

Currently, the absence of a structured way to attach and query metrics within traces limits the ability to perform targeted investigations. Without this feature, users must manually sift through large volumes of trace data or use external tools to filter and analyze metrics, which can be time-consuming and prone to errors. Implementing key/value pairs for metrics within traces would address this gap and provide a more streamlined and efficient solution.

Details

One way could be to add span.set_metrics and log with

mlflow_tracer = MlflowLangchainTracer()
with mlflow.start_span(name="evaluate") as span:
    span.set_inputs(input)
    output = chain.invoke(input, config={"callbacks": [mlflow_tracer]})
    span.set_outputs(output)
    metrics = evaluate_chain(expected, output)
    span.set_metrics(metrics)

Updating UI to show all metrics as columns as we see for runs and make it queryable via client

What component(s) does this bug affect?

[ ] area/artifacts: Artifact stores and artifact logging
[ ] area/build: Build and test infrastructure for MLflow
[ ] area/deployments: MLflow Deployments client APIs, server, and third-party Deployments integrations
[ ] area/docs: MLflow documentation pages
[ ] area/examples: Example code
[ ] area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
[ ] area/models: MLmodel format, model serialization/deserialization, flavors
[ ] area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
[ ] area/projects: MLproject format, project running backends
[ ] area/scoring: MLflow Model server, model deployment tools, Spark UDFs
[ ] area/server-infra: MLflow Tracking server backend
[X] area/tracking: Tracking Service, tracking client APIs, autologging

What interface(s) does this bug affect?

[ ] area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
[ ] area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
[ ] area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
[ ] area/windows: Windows support

What language(s) does this bug affect?

[ ] language/r: R APIs and clients
[ ] language/java: Java APIs and clients
[ ] language/new: Proposals for new client languages

What integration(s) does this bug affect?

[ ] integrations/azure: Azure and Azure ML integrations
[ ] integrations/sagemaker: SageMaker integrations
[ ] integrations/databricks: Databricks integrations

WeichenXu123 commented 3 months ago

One question:

What if you add metric computation value by span.set_outputs/attributes ? Any essential difference that makes a metric can't be added as an output/attribute ?

Q2: "tracing" is for LLM inference which is performance sensitive workload. Generally, in LLM inference tracing, we should avoid introducing too much overhead. If you want to evaluating metrics which is a heavy workload, it seems tracing is not a good place to do it. (maybe adding similar functionality in mlflow.models.evaluate is better)

BenWilson2 commented 3 months ago

cc @B-Step62

harshilprajapati96 commented 3 months ago

Right now I am just starting an overarching span and then adding it as its attributes

mlflow_tracer = MlflowLangchainTracer()
with mlflow.start_span(name="evaluate") as span:
    span.set_inputs(input)
    output = chain.invoke(input, config={"callbacks": [mlflow_tracer]})
    span.set_outputs(output)
    metrics = evaluate_chain(expected, output)
    span.set_attributes(metrics)

This works for now but having queryable metrics would be nice to have.

How do I set tags in the span? I tried adding tags in Context and passing in MlflowLangchainTracer but that doesn't work.

I feel having metrics on traces would make it easier debug, even if we could track trace id and then later calculate and add. We use langfuse score for that right now https://langfuse.com/docs/scores/overview

WeichenXu123 commented 3 months ago

for setting tags, you can use

MlflowClient().set_trace_tag(span.request_id, "key", "value")

harshilprajapati96 commented 2 months ago

Any thoughts on this FR?

github-actions[bot] commented 2 months ago

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

mlflow / mlflow