open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.64k stars 871 forks source link

Discuss modeling data lineage/provenance in OTEL #3447

Open m-hogue opened 1 year ago

m-hogue commented 1 year ago

What are you trying to achieve?

We use OTEL to collect and persist metrics, logs, and trace telemetry data already today. We do this through the OTEL operator & collector. It's an amazing way to standardize on observability APIs and to decouple metric collection from its persistence and retrieval.

I want to discuss another potential top-level Signal in OTEL for Data Lineage, which is sometimes referred to as data provenance. Data lineage is useful in cases where you want to explain the precise path that records of data traverse across your architecture and where data manipulations occur. Visualizing/analyzing this enables you to identify potential transit or processing inefficiencies in your architecture. It's also an extremely valuable signal in zero-trust environments where you want to closely observe where data transits, gets modified, or is otherwise touched within and external to your organization, so you can verify your zero-trust controls globally.

I see data lineage as a fundamentally different signal than traces. For example, traces capture fine-grained details about a specific round-trip function or request. That is, it aims to capture everything that happens between a question and an answer. Data lineage captures point-in-time acknowledgement that a piece of data was seen/modified/duped/etc at some location in your end-to-end architecture.

There are a few examples of provenance/lineage in existing projects. These projects could provide a starting point for how lineage can be modeled and used:

I raised this idea to a few OTEL maintainers at KubeCon EU last week and spoke with @carlosalberto, who kindly requested that i tag him in this issue. In this discussion, it was proposed that perhaps data lineage could be modeled in the existing OTEL Event API. While true, unless a standard model is adopted for representing lineage in the event API, it will be challenging to use a consistent model across teams/applications/organizations in your enterprise. Since the idea is to create an end-to-end data lineage signal across the global ecosystem, a natively defined lineage model in OTEL would be a nice solution to this challenge.

I'm happy to provide additional context or justification for any of this and i'm capable of contributing just as well.

lesterhaynes commented 3 months ago

I'm interested in this, especially with where the industry is heading with GenAI. What feedback have you gotten so far on this from the community? Do folks consider it in-scope for the project? Is there a working group?

pyohannes commented 3 months ago

In this discussion, it was proposed that perhaps data lineage could be modeled in the existing OTEL Event API. While true, unless a standard model is adopted for representing lineage in the event API, it will be challenging to use a consistent model across teams/applications/organizations in your enterprise.

One possible extension to this solution (representing lineage events with the existing Event API) would be to standardize the event structure in semantic conventions. This could provide a standard model for representing lineage events via the Event API.

If enough people are interested, maybe a workgroup could be formed around this? Semantic conventions can be added without too much effort. If it turns out that this is not satisfactory, (more informed) discussions about introducing an additional signal are still possible.

tedsuo commented 3 months ago

If all that is needed to implement Data Lineage are semantic conventions for events, please go ahead and make an issue in the semantic convention repo.

If Data Lineage is a domain that requires context propagation in order to be implemented well – in other words, it needs to be its own top level signal – then this project would be out of scope for OpenTelemetry, but an ideal candidate for a separate project that builds it's own API on top of Baggage. The Baggage API was designed to allow 3rd party cross-cutting concerns to propagate information without needing to reinvent the wheel. If it's a big enough project, I'd suggest applying to the CNCF for sandbox status.

alolita commented 3 weeks ago

Hi @m-hogue Data provenance and lineage in the AI application context would be useful to define with the existing tracing and event specs. Please join in for defining semantic conventions in our LLM workgroup held on Wednesdays every week. Ping me on Slack - I'd definitely be interested in what you have in mind so far.