open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
220 stars 141 forks source link

What is the implication of breaking changes to stable semantic conventions #772

Open pyohannes opened 4 months ago

pyohannes commented 4 months ago

As semantic conventions for HTTP are declared stable (and hopefully messaging and databases will follow soon), certain changes outlined in the version and stability document are prohibited.

For metrics, those prohibited changes also include the addition of additional attributes to an existing (stable) metric (see #722 for further discussions). As this drastically limits the extensibility of existing stable metrics, in discussions around messaging and database metrics the question came up whether there should be a defined process for such breaking changes (for example adding additional attributes to an existing stable metric).

The following options might be possible (this is not an exhaustive list):

  1. Forbid any breaking changes. a. Instead of adding attributes to an existing metric, a new metric with the additional attributes would need to be added.
  2. For applying breaking changes, require a new major version for semantic conventions and consequently for instrumentation libraries implementing them.
  3. For applying breaking changes, require only a new major version for instrumentation libraries instrumenting them.

Especially in regards to databases and messaging system-specific metrics this question is important. There is a goal of providing stable semantic conventions and instrumentation libraries for such systems, however, with those systems and their instrumentation evolving it is doubtful whether relying on a never-changing set of metric attributes is feasible.

pyohannes commented 4 months ago

Especially in regards to databases and messaging system-specific metrics this question is important. There is a goal of providing stable semantic conventions and instrumentation libraries for such systems, however, with those systems and their instrumentation evolving it is doubtful whether relying on a never-changing set of metric attributes is feasible.

Regarding this question about extending database and messaging system-specific metrics (see also #760), the following two use cases were discussed in the semconv SIG on Feb 26th to illustrate the impact of the different options to telemetry consumers.

  1. For alerts defined on default time series, both options (2) and (3) would be possibly breaking. While such breaking changes could certainly be indicated by a new major version of the agent or instrumentation library used, this in fact might frustrate users as the impact of instrumentation library updates likely will not be clear to them, and in the worst case the breakage will not be obvious, but existing alerts will just stop to fire as they used to.

    One possible solution would be to discourage users from defining alerts on default time series, and strongly recommend to define alerts on aggregations of a fixed subsets of attributes.

  2. Option (1) will make it much harder to build generic out-of-the-box APM experiences and dashboards based on generic messaging or database metrics and attributes. This could for example involve RED dashboards for messaging consumer scenarios (with throughput, error rate, and latency), or the error rate and latency of downstream database calls. Both those cases are standard features of existing APM solutions.

    As a possible workaround, one could take into account multiple versions of sufficiently similar metrics and aggregate those to achieve the same goal. However, this is quite tedious and in certain ways diminishes the value of semantic conventions as a unified standard.

Every outlined option breaks or impedes one of those two use cases, so we would need to decide which impediments we want to impose on telemetry consumers.

arminru commented 4 months ago

One flavor that could be added to the options (2) and (3) above:

lmolkova commented 3 months ago

I believe we've also entertained an option 1.5 when it comes to adding an attribute to a metric which is quite similar to @arminru suggestion above:

lmolkova commented 2 months ago

Related discussion https://github.com/open-telemetry/semantic-conventions/pull/675#discussion_r1538357461:

The SemVer 2.0 is specific to public API and semconv do not strictly fit into this category. It's a common practice to do some behavior-breaking changes without major version update. E.g. certain bug fixes or improvements can fit into this category.

Some changes bring significant improvement to user experience, e.g. adding a route/operation name to HTTP client metric and changing the span name accordingly. By making it opt-in we're effectively keeping the absolute majority of users from leveraging it (because it's hard to discover a flag). Current and future users would have a subpar experience only because of our back-compat concerns.

While decision in each case can vary, we should have a process to allow some low risk/high-reward breaking changes without a major version update. This process should involve weighting pros and cons.