open-telemetry / semantic-conventions

Defines standards for generating consistent, accessible telemetry across a variety of domains
Apache License 2.0
231 stars 146 forks source link

Why is adding new attributes to a metric a breaking change? #722

Open alanwest opened 5 months ago

alanwest commented 5 months ago

Is there a clear rationale documented some where for why adding new attributes to a metric is a breaking change?

This PR comment is one of the earliest assertions that I'm aware of that adding attributes is breaking, but it has remained unclear to me whether folks have a shared understanding of why.

ADDITIVE changes to metrics can break user alerts by creating new timeseries and adjusting thresholds. I think we need to be a lot more careful around changes, particularly around metrics here.

I do not personally have experience with a backend where this is true, but if it is true for some backend, I think it would be useful to articulate this some where.

trask commented 5 months ago

I agree it would be great to document the rationale since this has been surprising to many people.

In the meantime, I think the best we have is @jsuereth's document https://docs.google.com/document/d/1Nvcf1wio7nDUVcrXxVUN_f8MNmcs0OzVAZLvlth1lYY/edit#heading=h.58b8u7y5tsmu, see the "Instability: Alerting w/ Metric Attribute changes" section

pyohannes commented 5 months ago

It would be great to clarify that. If we would apply that rule strictly, wouldn't we need to say that instrumentations MUST NOT add any additional attributes to metrics that are not covered by the specification?

For example, if ASP.NET Core 8 adds an additional attribute to http.server.request.duration, wouldn't that be breaking in the same sense, as it adds a new time series?

  • Additional attributes:

    • The aspnetcore.request.is_unhandled boolean attribute is reported when the request was not handled by the application pipeline. It's skipped otherwise.
pyohannes commented 5 months ago

That's the exact wording from the spec:

In addition to the 3 types of changes described above there are certain types that are always allowed. Such changes do not need to be described (and are not described) by schema files. Here is the list of such changes: [...]

  • Adding new attributes to existing metrics that do not "break apart" existing timeseries, such that alert thresholds would break / need to change.
lmolkova commented 5 months ago

@pyohannes adding attribute to an existing metric that application has already reported is breaking, but adding new attributes to a new metric application has never reported is not.

E.g. when my application goes from no-routing (http.route is never reported) to routing, it's not a breaking change.

Or when my application never reported any errors (lucky!) starts to report some errors in error.type, this didn't change the metric definition and it's my application problem that my dashboards are wrong (e.g. counting throughput without checking for success).

I.e. breaking (that we want to prevent) happens when an existing time series splits, but not when new data that has never been seen before comes with new dimensions.

pyohannes commented 5 months ago

E.g. when my application goes from no-routing (http.route is never reported) to routing, it's not a breaking change.

Or when my application never reported any errors (lucky!) starts to report some errors in error.type, this didn't change the metric definition and it's my application problem that my dashboards are wrong (e.g. counting throughput without checking for success).

What if I switch from .NET 6 to .NET 8, and aspnetcore.request.is_unhandled shows up? Would this classify as breaking change?

joaopgrassi commented 5 months ago

What if I switch from .NET 6 to .NET 8, and aspnetcore.request.is_unhandled shows up? Would this classify as breaking change?

Reading the description of that metric:

The aspnetcore.request.is_unhandled boolean attribute is reported when the request was not handled by the application pipeline. It's skipped otherwise.

I'm inclined to say it's not a breaking change, but I may be wrong as I haven't looked at how it will be booked. If I get it right, the http.server.request.duration is always booked, but when the request is not handled the attribute is added, otherwise it is not added. So imagine I have an alert/dashboard based on this metric and I don't aggregate it by any attribute, just metric name. Since the metric is always booked, I will still get the same value when looking at it. There's an additional timeseries when the attribute is added, but if I'm not looking at that attribute, then should still be OK.

When I think it's a problem of adding a new attribute is what Josh described in his document (one @trask linked above):

image

Imagine you have an alert based on a metric that is booked by an instrumentation with two attributes attri1 and attr2 - that is the left side of the diagram. At this point in time we have 1 time series:

myMetric.add(1, attr1, attr2)

Then, the instrumentation library adds a new attribute to the same code line where it used to book only the two attributes. Now what happens is, we have "2" time series, the first with 2 attributes and the new with 3.

The problem and why it is breaking, is the library will not book data in first time series anymore (with 2 attributes only). It will only book on the new time series, with 3 attributes. This is when my alert is broken, because I won't get more data in it.

- myMetric.add(1, attr1, attr2)
+ myMetric.add(1, attr1, attr2, attr3)

One way to "fix" this problem and make adding new attributes backwards compatible, is instrumentations should still book a metric for the previous attributes and another one for the new.

  myMetric.add(1, attr1, attr2)
+ myMetric.add(1, attr1, attr2, attr3)
pyohannes commented 5 months ago

Thanks @joaopgrassi, I understand better now.

One way to "fix" this problem and make adding new attributes backwards compatible, is instrumentations should still book a metric for the previous attributes and another one for the new.

This would need a completely different metric name, otherwise you'll get wrong values for aggregations of time series.

I do not personally have experience with a backend where this is true, but if it is true for some backend, I think it would be useful to articulate this some where.

Understanding the problem a bit better now, I realize this can possibly break for any backend that supports proper PromQL.

Given you have a metric myMetric with attributes attr0, attr1, and attr2, you can aggregate by attr1 and attr2 with this query:

sum without (attr0) (myMetric)

Now given that you introduce an additional attribute attr3 (as in the example @joaopgrassi gave), time series returned by the above query will split, as aggregation now happens along attr1, attr2, and attr3.

Consequently, this is an issue at least for backends supporting PromQL, which means it isn't just a niche problem.

jsuereth commented 5 months ago

Should we migrate this rationale description into some kind of guidance on semconv? I'd love to not repeat this discussion every time someone unfamiliar with Prometheus/PromQL (and its ilk) join semconv....

Let me know where and we can migrate the google doc @trask linked to there.

lmolkova commented 5 months ago

E.g. when my application goes from no-routing (http.route is never reported) to routing, it's not a breaking change. Or when my application never reported any errors (lucky!) starts to report some errors in error.type, this didn't change the metric definition and it's my application problem that my dashboards are wrong (e.g. counting throughput without checking for success).

What if I switch from .NET 6 to .NET 8, and aspnetcore.request.is_unhandled shows up? Would this classify as breaking change?

It could be but I think otel-dotnet folks did some effort to make .NET6 and .NET 8 metrics to be consistent before HTTP instr went stable. Given that 6->8 involves a major version update for the platform, breaking changes are to be expected as long as they are documented.

I think the breaking changes we should focus on in the semconv are instrumentation library updates when user didn't do anything in their code that would provoke them.

E.g. if I switch from one messaging system to another, I will need to update my dashboards and alerts. Or if I start using new API that emits additional attributes on the same metric (e.g. I used messaging receive API and now I use processing API). I don't see how we can prevent breaking like this.

joaopgrassi commented 5 months ago

Should we migrate this rationale description into some kind of guidance on semconv? I'd love to not repeat this discussion every time someone unfamiliar with Prometheus/PromQL (and its ilk) join semconv....

💯%

A good place for it is https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md

alanwest commented 5 months ago

@lmolkova touched on an example (https://github.com/open-telemetry/semantic-conventions/issues/722#issuecomment-1936341059) that I want to dig into a bit deeper because I'm still confused :confused: even after reading @jsuereth's doc and the discussion here...

Or when my application never reported any errors (lucky!) starts to report some errors in error.type, this didn't change the metric definition and it's my application problem that my dashboards are wrong (e.g. counting throughput without checking for success).

Making this a bit more concrete with some pseudocode, I'm confident that the community has shipped stable HTTP instrumentation across a number of languages that looks something like the following:

func RecordResponseTime(duration, httpRoute, statusCode, errorType) {
   var histogram = meter.CreateHistogram("http.server.request.duration");

   if (status_code < 500) {
      histogram.Record(duration, {
         "http.route" = httpRoute,
         "http.response.status_code" = statusCode
      })
   } else {
      histogram.Record(duration, {
         "http.route" = httpRoute,
         "http.response.status_code" = statusCode,
         "error.type" = errorType
      });
   }
}

If my application has been running flawlessly for 10 days and then on the 11th day some percentage of requests now experience errors. Both this example and the scenario of upgrading instrumentation which adds an attribute to an existing metric seem to share the same risk. Don't they both have the potential of breaking my dashboards and alerts when using Prometheus/PromQL because there are now at least two timeseries?

Back in the day, before OpenTelemetry, if I was instrumenting my code with a Prometheus client library, would I be expected to have written my instrumentation ensuring the metric always recorded the same set of attributes? Something like:

func RecordResponseTime(duration, httpRoute, statusCode, errorType) {
   var histogram = meter.CreateHistogram("http.server.request.duration");

   if (errorType == null) {
      errorType = "unset";
   }

   histogram.Record(duration, {
      "http.route" = httpRoute,
      "http.response.status_code" = statusCode,
      "error.type" = errorType
   });
}

It's true, I'm not an expert with Prometheus/PromQL, but I'm surprised this is a problem that needs to be addressed by instrumentation authors rather than alert/dashboard authors. Is it difficult or unusual practice to write alerts that are not brittle using PromQL? For example, if I wanted to alert when overall request latency exceeds a threshold, it seems like it would be a bug not to craft the alert to aggregate all timeseries into one.

I have to imagine I'm still missing something...

pyohannes commented 5 months ago

It's true, I'm not an expert with Prometheus/PromQL, but I'm surprised this is a problem that needs to be addressed by instrumentation authors rather than alert/dashboard authors.

This is more about a contract between instrumentation authors and alert/dashboard authors. The question being whether instrumentation provide a fixed or extensible set of metric attributes.

Is it difficult or unusual practice to write alerts that are not brittle using PromQL?

I would say no. The idea to define an alert that doesn't define aggregations on a fixed set of attributes didn't occur to me before this discussion, but I'm not sure this can be generalized. It's for sure possible to do it and to define alerts that are "brittle" in that sense.

lmolkova commented 5 months ago

@alanwest

The assumption is that users should write alerts/build dashboards based on the metric definition rather than the actual data. I.e. I can write

histogram_quantile(
  0.9,
  sum by(le, http_route, error_type) (
    rate(http_server_request_duration[5m])
  )
)

just based on the md/yaml files before any actual data flows into my backend.

jsuereth commented 5 months ago

Both this example and the scenario of upgrading instrumentation which adds an attribute to an existing metric seem to share the same risk. Don't they both have the potential of breaking my dashboards and alerts when using Prometheus/PromQL because there are now at least two timeseries?

So actually empty labels in prometheus are modeled as such. In your example where you'd set the "error.type" to "unset" is achieved by just not passing error.type. Every metric point effectively "fills out" labels with empty strings in the data model.

As Liudmilla points out, always grouping ({metric} by ({labels})) is a good way to avoid breakage on new labels. However, in this case you won't actually see a breakage because the "counts" from errors never make it into the same timeseries as the "counts" from non-errors. By design, these are two separate timeseries. You can join them with a by grouping to get a total count, but having them as separate timeseries allows us to do things like check error rate %, etc.

pyohannes commented 4 months ago

Should we migrate this rationale description into some kind of guidance on semconv? I'd love to not repeat this discussion every time someone unfamiliar with Prometheus/PromQL (and its ilk) join semconv....

💯%

A good place for it is https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/metrics.md

We discussed this again today in the messaging workgroup, we agreed it would be great to add some reasoning around this to semantic conventions.