Library-specific metric semantic conventions

carlosalberto commented 2 years ago

Currently, many of the Collector receivers report metrics from existing libraries (e.g. Postgresql, Redis). These can be roughly included in two categories:

Potentially common metrics (messages waiting for any message queue, number of requests in any http server)
Very specific metrics (partitions in N-different states, library specific error threshold)

Far too many metrics fall in the second category, and a recurring question is whether the Specification itself should contain those metrics, or whether they should be listed somewhere else, given they are library specific (and the OTel community may not have enough expertise to define them in the best way).

jack-berg commented 2 years ago

I think this is a great topic to discuss.

This problem isn't unique to the collector. Java has:

A general purpose jmx metric gatherer.
A general purpose micrometer shim for bridging metrics from the popular micrometer framework.
I've recently been working on bridging kafka client metrics into OpenTelemetry.

All of these are in a similar position as the collector's receivers in that they bridge in telemetry from other monitoring systems. Does that telemetry need to be codified in the semantic conventions? I say no.

The telemetry data collected by these components is certainly useful, but we don't control it at the point of instrumentation. We are bound by what data is exposed by the systems being bridged, and that data may change from version to version. The semantic conventions carry guarantees of stability that are likely stronger than the guarantees made by the systems we're bridging.

I suspect what folks are really after by trying to codify these in semantic conventions is some way to offer stronger stability guarantees to their users. But I think this is folly. Consider the java micrometer shim. We could inspect the metrics produced by the shim and come up with a list of metrics to add to the semantic conventions, but the names of the metrics we codify could change with the next release of micrometer. This situation isn't very different than trying to monitor kafka brokers with the collector. The kafka semantic conventions can't stop kafka from adding, removing, or changing the metrics in a future version.

I think if we want to make stronger guarantees in these telemetry bridging components, they should be rule based. I.e. we guarantee the translation of telemetry from a source to opentelemetry should be consistent such that if the source produces the same telemetry, it should be translated to the opentelemetry data model deterministically. This is actually the type of thing we do when we describe the rules for mapping prometheus to the metrics data model. That's effectively what is happening in each one of these bridging components: we want to monitor a system that has its own data model for emitting telemetry, and map that to the open telemetry data model. We can describe the rules for how that translation takes place, but we can't make stronger guarantees about the specifics of the resulting telemetry because we don't control the instrumentation at the source.

This suggests that we need a clearer definition of the scope of semantic conventions. I don't think the idea that "all telemetry produced by any opentelemetry component should be codified in semantic conventions" is tractable. The data provided by these bridging components is definitely useful and we should definitely collect it, but the quantity of telemetry that would have to be parsed and the lack of stability guarantees make it a poor fit for semantic conventions.

I think we should limit the scope of semantic conventions to govern instrumentation written at the source. If you can directly instrument, inject instrumentation, or wrap a library to add instrumentation, that should be codified in the semantic conventions. If you're simply bridging data from some other data model, you should not try to codify that in the semantic conventions. If users want stronger stability guarantees about bridged data, make guarantees about the translation semantics.

djaglowski commented 2 years ago

I agree with the sentiment that most technology specific metrics should not be governed by the semantic conventions, and that rules or guidelines could be established to govern how such metrics are named and managed.

Roughly, I think the following would be a good start:

Metric names should follow general naming guidelines established in the semantic conventions.
Each metric that is named by an OpenTelemetry domain (eg collector or library) should have a declared stability level. (experimental, stable, deprecated, removed)
The threshold for establishing a semantic convention for a specific metric is that multiple domains (eg. collector and java instrumentation) will emit meaningfully the same metric.

tigrannajaryan commented 2 years ago

I think one general guidance should be that we always add more generic semantic conventions before adding more specific ones.

So, for example the generic "database" conventions should be added before "Postgres" or "Mysql" are added, generic "messaging" conventions should added before "Kafka" or "Kinesis" are added, generic "os" conventions should be added before "windows" or "linux" are added.

In some cases the more generic conventions will be actually sufficient to express what the more specific convention aims to capture. Yes, unfortunately writing generic semantic conventions may be more difficult: you have to be a subject matter expert on a class of products, not just one product to know how to generalize correctly.

This comment doesn't answer the question of how specific or generic a semantic convention should be to be allowed a place in Otel spec, but at least it provides some guidance on prioritization of this work and can also be the reason why a specific convention PR is rejected (because the generic one doesn't exists yet).

tigrannajaryan commented 2 years ago

I suspect what folks are really after by trying to codify these in semantic conventions is some way to offer stronger stability guarantees to their users. But I think this is folly. Consider the java micrometer shim. We could inspect the metrics produced by the shim and come up with a list of metrics to add to the semantic conventions, but the names of the metrics we codify could change with the next release of micrometer. This situation isn't very different than trying to monitor kafka brokers with the collector. The kafka semantic conventions can't stop kafka from adding, removing, or changing the metrics in a future version.

One tangential way to help with this is Telemetry Schemas. By allowing to formalize this changes in Schemas we make the change more manageable. What we are likely missing here is the ability to derive and extend Schemas from Otel Schemas. This Kafka can independently maintain, evolve and publish its own schema that is derived from Otel schema (which includes generic conventions about messaging systems).

I don't think the idea that "all telemetry produced by any opentelemetry component should be codified in semantic conventions" is tractable.

I agree. It is an impossible mountain to climb.

I think we should limit the scope of semantic conventions to govern instrumentation written at the source. If you can directly instrument, inject instrumentation, or wrap a library to add instrumentation, that should be codified in the semantic conventions. If you're simply bridging data from some other data model, you should not try to codify that in the semantic conventions. If users want stronger stability guarantees about bridged data, make guarantees about the translation semantics.

And also guarantees about how that bridged (foreign?) data evolves over time, aka derived Schemas.

lmolkova commented 2 years ago

One tangential way to help with this is Telemetry Schemas. By allowing to formalize this changes in Schemas we make the change more manageable. What we are likely missing here is the ability to derive and extend Schemas from Otel Schemas.

I would second this. In Azure SDK, we're trying to align with OTel conventions, but no matter how hard we try, we'll have some extensions (attributes, additional spans, events, and metrics).

With the current state of affairs I can get all or nothing:

keep all our conventions separate: re-declare HTTP and other conventions so we can extend them, and define our own schema files that copy OTel ones.
Put Azure SDK extensions into this repo and leverage all the tools, schemas, and other goodness.

the perfect solution as I see it:

[nice-to-have] external conventions can extend otel conventions: attribute references work across repos, build tools generate markdown and semconv code
external schema files can extend OTel schema file (i.e. I can have http://aka.ms/schemas/1.0.0/ that only refers to http://opentelemetry.io/schemas/1.42.0/) and evolve independently.
we have a registry of external conventions and extensions similarly to exporters and other components

carlosalberto commented 2 years ago

In some cases the more generic conventions will be actually sufficient to express what the more specific convention aims to capture. Yes, unfortunately writing generic semantic conventions may be more difficult: you have to be a subject matter expert on a class of products, not just one product to know how to generalize correctly.

I overall agree with this @tigrannajaryan - but at the same time I'm afraid nothing will be done, giving the high bar, i.e. nobody will work, review, approve nor continue working given this barrier. So we should try to find some balance, and being flexible.

I had mentioned that we should try to define conventions for metrics that are dead obvious, while making the rest specific as a starting point, and I still think that's the way to go.

carlosalberto commented 2 years ago

@djaglowski I like your proposal as a starting point. One thing I'd like to see is that OTel components (receivers, instrumentation) include a very short but clear document with the used metrics, so at least that's very visible.

jsuereth commented 2 years ago

If i can recap the discussion that I've seen up to this point, I think I see the following open questions:

We need to, more clearly, outline the scope of what we want to provide in Semantic Conventions for otel.
- All they all-encompassing of EVERYTHING that would be produced by otel instrumentation, or just a guaranteed subset?
- e.g. can I add additional http metrics beyond what is in the standard conventions in my otel instrumentation?
- Should we remove / create a new space for "implementation specific" metrics, e.g. mongodb specific or application specific things?
- Should we take any stance on what to do with "legacy" metrics, or those that predate our standards?
How do we document metrics provided by OTEL?
- As @djaglowski points out, we should clearly denote what is stable vs. experimental, etc. and support levels
- We should aim for some consistency across projects for how to determine what metrics are exposed and their labels (e.g. the collector has metadata.xml files, does it make sense for us to standardize on this idea and provide documentation-generation / other tooling against it? Can we somehow extent telemetry schema with these files?
Should we formalize the process (and requirements) for lifting a specific instrumentation libraries' metrics into common semantic conventions? (e.g. http, rpc, etc.). @djaglowski proposes we need 3 or more examples, I'm inclined to agree.
Should OTEL provide more than one telemetry schema URL / source in its instrumention? When does this happen?
- E.g. can Java SDK host its own telemetry schema + metadata around garbage collection metrics?

Perhaps I'm expanding this too far, but I think the first, and most important question for this issue is:

Should we provide a new location for projects to define their Metric Definitions (stability, names, labels) that is not the overall semantic conventions / otel specification?

I believe the answer to this is yes, in which case we basically need:

A way to denote what metrics are being generated, and their current support level (experimental, stable, etc.)
New telemetry schema URL locations provided by OTEL
(longer term) a set of tooling around defining metrics that simplifies documentation, testing and telemetry-schema evolution.

Is this an appropriate sum up?

jsuereth commented 2 years ago

So in the latest SiG we discussed and agreed that this direction is where we want to go.

So, to follow up that means we should (likely) have the following sub-projects here:

A (shared across OTEL) mechanism to denote WHAT telemetry is being produced. Collector's metadata.yaml file is a great starting point.
A process / location / tooling for defining new telemetry schema that pairs for supportability of metrics.
Tooling that produces documentation for exposed telemetry (our build-tools for specification semantic conventions, can we re-use these in library local places, e.g.)?

If we agree, I'd like to fragment three sub-issues for each of these (to hopefully get mulitple owners for different pieces) where we can solve each issue.

@djaglowski I think your proposal is best situated at the notion of defining WHAT telemetry is produced and a consistent way of doing that across the board. Would you be willing to lead that component/proposal?

djaglowski commented 2 years ago

@djaglowski I think your proposal is best situated at the notion of defining WHAT telemetry is produced and a consistent way of doing that across the board. Would you be willing to lead that component/proposal?

@jsuereth, yes, I can take the lead on this component. I can get started on this next week.

tigrannajaryan commented 2 years ago

I put some more thought into this. I think we can do this (apologies for a longs list but I think this is all needed):

There will be Core Otel semantic conventions (semconv for short), defined in this repository. This is what already have, identified by https://opentelemetry.io/schemas/* schema family.
There will be Derived semconv, defined elsewhere (within Otel or outside Otel).
A Derived semconv will always reference its Base semconv. For example Collector SIG may have its own set of semantic conventions to record Collector-specific metrics. Similarly Java SGI may have their own conventions to record JVM metrics. The Base semconv for both of these cases will be the Core Otel semconv.
All definitions present in the Base semconv are implicitly inherited by the Derived semconv, i.e. telemetry that is emitted according to Collector semconv can use all Otel Core semconv attributes, metrics, etc. (The opposite obviously is not true).
All names in Derived semconv are namespaced. This includes metric and span names and all attribute names. All such names must be prefixed by the namespace value allocated to the Deribed semconv. For example namespace java.* may be allocated to Java SIG and the SIG can define java.jvm.foo as a metric name or java.gc.bar as a span attribute name.
Derived semconv MUST NOT define any conventions that use names with prefixes that are part of the Base semconv.
To facilitate namespace separation and avoid conflicts the Base semconv must specify which namespaces are prohibited to be used by Dervied semconv (i.e. reserved for Base usage only).
When a Derived semconv is created its namespace(s) must be registered with the Base semconv. "Registration" here means maintenance of the list of namespace allocations (e.g. simply a list of namespaces in our specification repo is fine for Otel Core semconv).
Once namespace is allocated it cannot be used by other Derived semconvs that wish to derive from the same Base.
Schema File format will be modified to allow specifying a based_on setting to point to the Schema URL of the Base semconv. "based_on" can be empty, indicating a root of a lineage of semconvs.
Otel Core is the root of OpenTelemetry lineage. All semconvs Derived from Otel Core will publish a Schema File that has based_on: [https://opentelemetry.io/schemas/](https://opentelemetry.io/schemas/*)<version_number> entry.
Derived semconv can evolve independently from its Base. New attributes may be introduced, attributes can be renamed, etc. When changes happen a corresponding new Schema File will be published for the Derived semconv. The new Schema File will continue to reference the Base Schema URL from which it branched.
All versions of Derived Schema File MUST reference based_on Schema URL that belongs to the same Schema Family (changing families is prohibited).
As Base semconv evolves and publishes new versions of Schema Files, the Derived semconv can make an independent decision to "rebase" to a newer version of its Base. When this happens the semantic conventions inherited from the Base are implicitly updated to the new version defined by the Base. The Derived semconv MUST publish a new Schema File with the value of based_on setting reflecting the new version of the Base it now uses.
It is possible to transform telemetry from any version of the Derived semconv to another version using its Schema File (subject to regular limitations of Schema Files, i.e. reversibility when downgrading). The transformations are inherited from the Base. Transformations from the Base between inheritance points of different Derived versions will be applicable (this requires a diagram to make clear, but I think it should work).
It is possible to transform telemetry from any version of Derived semconv to any version of its Base's family (requires a diagram, but again, I think this should be possible). This allows to receive for example telemetry from Otel Collector referencing derived Schema URL http://opentelemetry.io/collector/schemas/1.5.0 that is based on [http://opentelemetry.io/schemas/1.1.0](http://opentelemetry.io/collector/schemas/1.5.0) and convert to [http://opentelemetry.io/collector/schemas/1.2.0](http://opentelemetry.io/collector/schemas/1.5.0). Essentially an entire family of independently evolved libraries can emit telemetry and we can normalize it to a particular version of Otel semconv - this is very important for recipients of telemetry, who only need to know about the root family of any lineage and don't need to know anything specifically about Derived schemas, except being able to fetch their Schema Files.

This definitely needs to become an OTEP (also to confirm that some of my speculations above are really doable) but before doing that I wanted to post this to see what others think.

open-telemetry / opentelemetry-specification

Library-specific metric semantic conventions #2610