Open carlosalberto opened 2 years ago
I think this is a great topic to discuss.
This problem isn't unique to the collector. Java has:
All of these are in a similar position as the collector's receivers in that they bridge in telemetry from other monitoring systems. Does that telemetry need to be codified in the semantic conventions? I say no.
The telemetry data collected by these components is certainly useful, but we don't control it at the point of instrumentation. We are bound by what data is exposed by the systems being bridged, and that data may change from version to version. The semantic conventions carry guarantees of stability that are likely stronger than the guarantees made by the systems we're bridging.
I suspect what folks are really after by trying to codify these in semantic conventions is some way to offer stronger stability guarantees to their users. But I think this is folly. Consider the java micrometer shim. We could inspect the metrics produced by the shim and come up with a list of metrics to add to the semantic conventions, but the names of the metrics we codify could change with the next release of micrometer. This situation isn't very different than trying to monitor kafka brokers with the collector. The kafka semantic conventions can't stop kafka from adding, removing, or changing the metrics in a future version.
I think if we want to make stronger guarantees in these telemetry bridging components, they should be rule based. I.e. we guarantee the translation of telemetry from a source to opentelemetry should be consistent such that if the source produces the same telemetry, it should be translated to the opentelemetry data model deterministically. This is actually the type of thing we do when we describe the rules for mapping prometheus to the metrics data model. That's effectively what is happening in each one of these bridging components: we want to monitor a system that has its own data model for emitting telemetry, and map that to the open telemetry data model. We can describe the rules for how that translation takes place, but we can't make stronger guarantees about the specifics of the resulting telemetry because we don't control the instrumentation at the source.
This suggests that we need a clearer definition of the scope of semantic conventions. I don't think the idea that "all telemetry produced by any opentelemetry component should be codified in semantic conventions" is tractable. The data provided by these bridging components is definitely useful and we should definitely collect it, but the quantity of telemetry that would have to be parsed and the lack of stability guarantees make it a poor fit for semantic conventions.
I think we should limit the scope of semantic conventions to govern instrumentation written at the source. If you can directly instrument, inject instrumentation, or wrap a library to add instrumentation, that should be codified in the semantic conventions. If you're simply bridging data from some other data model, you should not try to codify that in the semantic conventions. If users want stronger stability guarantees about bridged data, make guarantees about the translation semantics.
I agree with the sentiment that most technology specific metrics should not be governed by the semantic conventions, and that rules or guidelines could be established to govern how such metrics are named and managed.
Roughly, I think the following would be a good start:
I think one general guidance should be that we always add more generic semantic conventions before adding more specific ones.
So, for example the generic "database" conventions should be added before "Postgres" or "Mysql" are added, generic "messaging" conventions should added before "Kafka" or "Kinesis" are added, generic "os" conventions should be added before "windows" or "linux" are added.
In some cases the more generic conventions will be actually sufficient to express what the more specific convention aims to capture. Yes, unfortunately writing generic semantic conventions may be more difficult: you have to be a subject matter expert on a class of products, not just one product to know how to generalize correctly.
This comment doesn't answer the question of how specific or generic a semantic convention should be to be allowed a place in Otel spec, but at least it provides some guidance on prioritization of this work and can also be the reason why a specific convention PR is rejected (because the generic one doesn't exists yet).
I suspect what folks are really after by trying to codify these in semantic conventions is some way to offer stronger stability guarantees to their users. But I think this is folly. Consider the java micrometer shim. We could inspect the metrics produced by the shim and come up with a list of metrics to add to the semantic conventions, but the names of the metrics we codify could change with the next release of micrometer. This situation isn't very different than trying to monitor kafka brokers with the collector. The kafka semantic conventions can't stop kafka from adding, removing, or changing the metrics in a future version.
One tangential way to help with this is Telemetry Schemas. By allowing to formalize this changes in Schemas we make the change more manageable. What we are likely missing here is the ability to derive and extend Schemas from Otel Schemas. This Kafka can independently maintain, evolve and publish its own schema that is derived from Otel schema (which includes generic conventions about messaging systems).
I don't think the idea that "all telemetry produced by any opentelemetry component should be codified in semantic conventions" is tractable.
I agree. It is an impossible mountain to climb.
I think we should limit the scope of semantic conventions to govern instrumentation written at the source. If you can directly instrument, inject instrumentation, or wrap a library to add instrumentation, that should be codified in the semantic conventions. If you're simply bridging data from some other data model, you should not try to codify that in the semantic conventions. If users want stronger stability guarantees about bridged data, make guarantees about the translation semantics.
And also guarantees about how that bridged (foreign?) data evolves over time, aka derived Schemas.
One tangential way to help with this is Telemetry Schemas. By allowing to formalize this changes in Schemas we make the change more manageable. What we are likely missing here is the ability to derive and extend Schemas from Otel Schemas.
I would second this. In Azure SDK, we're trying to align with OTel conventions, but no matter how hard we try, we'll have some extensions (attributes, additional spans, events, and metrics).
With the current state of affairs I can get all or nothing:
the perfect solution as I see it:
In some cases the more generic conventions will be actually sufficient to express what the more specific convention aims to capture. Yes, unfortunately writing generic semantic conventions may be more difficult: you have to be a subject matter expert on a class of products, not just one product to know how to generalize correctly.
I overall agree with this @tigrannajaryan - but at the same time I'm afraid nothing will be done, giving the high bar, i.e. nobody will work, review, approve nor continue working given this barrier. So we should try to find some balance, and being flexible.
I had mentioned that we should try to define conventions for metrics that are dead obvious, while making the rest specific as a starting point, and I still think that's the way to go.
@djaglowski I like your proposal as a starting point. One thing I'd like to see is that OTel components (receivers, instrumentation) include a very short but clear document with the used metrics, so at least that's very visible.
If i can recap the discussion that I've seen up to this point, I think I see the following open questions:
Perhaps I'm expanding this too far, but I think the first, and most important question for this issue is:
Should we provide a new location for projects to define their Metric Definitions (stability, names, labels) that is not the overall semantic conventions / otel specification?
I believe the answer to this is yes, in which case we basically need:
Is this an appropriate sum up?
So in the latest SiG we discussed and agreed that this direction is where we want to go.
So, to follow up that means we should (likely) have the following sub-projects here:
If we agree, I'd like to fragment three sub-issues for each of these (to hopefully get mulitple owners for different pieces) where we can solve each issue.
@djaglowski I think your proposal is best situated at the notion of defining WHAT telemetry is produced and a consistent way of doing that across the board. Would you be willing to lead that component/proposal?
@djaglowski I think your proposal is best situated at the notion of defining WHAT telemetry is produced and a consistent way of doing that across the board. Would you be willing to lead that component/proposal?
@jsuereth, yes, I can take the lead on this component. I can get started on this next week.
I put some more thought into this. I think we can do this (apologies for a longs list but I think this is all needed):
https://opentelemetry.io/schemas/*
schema family.java.*
may be allocated to Java SIG and the SIG can define java.jvm.foo
as a metric name or java.gc.bar
as a span attribute name.based_on
setting to point to the Schema URL of the Base semconv. "based_on" can be empty, indicating a root of a lineage of semconvs. based_on: [https://opentelemetry.io/schemas/](https://opentelemetry.io/schemas/*)<version_number>
entry.based_on
Schema URL that belongs to the same Schema Family (changing families is prohibited).based_on
setting reflecting the new version of the Base it now uses.http://opentelemetry.io/collector/schemas/1.5.0
that is based on [http://opentelemetry.io/schemas/1.1.0](http://opentelemetry.io/collector/schemas/1.5.0)
and convert to [http://opentelemetry.io/collector/schemas/1.2.0](http://opentelemetry.io/collector/schemas/1.5.0)
. Essentially an entire family of independently evolved libraries can emit telemetry and we can normalize it to a particular version of Otel semconv - this is very important for recipients of telemetry, who only need to know about the root family of any lineage and don't need to know anything specifically about Derived schemas, except being able to fetch their Schema Files.This definitely needs to become an OTEP (also to confirm that some of my speculations above are really doable) but before doing that I wanted to post this to see what others think.
Currently, many of the Collector receivers report metrics from existing libraries (e.g. Postgresql, Redis). These can be roughly included in two categories:
Far too many metrics fall in the second category, and a recurring question is whether the Specification itself should contain those metrics, or whether they should be listed somewhere else, given they are library specific (and the OTel community may not have enough expertise to define them in the best way).