open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.71k stars 887 forks source link

Multi-gauge for metrics at API and Transport level #3949

Closed sunng87 closed 5 months ago

sunng87 commented 6 months ago

What is Multi-gauge

A Multi-gauge is a gauge that allows one or more value fields, each field has its own name. The values in a multi-gauge record share same timestamp, and attributes.

In current spec of OTLP metrics, a gauge can have only one single value. The whole observability ecosystem has been using this protocol for years. However, in a few cases, gauge values are sampled at the same time and share most of their tags. For instance, when we are collecting CPU usage, there are values for different modes: user/sys/io-wait etc. These values are sampled at same time and share most other attributes and timestamp.

Pros

Cons

Additional context.

trask commented 6 months ago

cc @jmacd I think you've mentioned/thought about this before?

tigrannajaryan commented 6 months ago

@jmacd @lquerel can you please comment on the "wire size" aspect and how the columnar encoding that we already have solves this and whether you see the need to have a different multigauge API?

lquerel commented 6 months ago

Sorry for the long response, but there's a lot to say on this topic.

Multi-gauge is one aspect of what I generally call multivariate metrics (a combination of several metrics, whether they are gauges, counters, histograms, etc.). The lack of native support for multivariate metrics is one of the reasons that motivated me to initiate the OpenTelemetry Protocol with Apache Arrow project in 2021. Initially, the protocol aimed to natively support multivariate metrics, but as mentioned in this issue, ideally, such native support would also require a complete overhaul of the ecosystem, i.e., native support for multivariate metrics in the SDK clients, the protocol, the collector pipeline, and the backends. To achieve a result within a reasonable timeframe, it was decided to focus solely on the protocol part at first. The way the OTel protocol with Apache Arrow aims to address this issue can be summarized by the following steps:

To conclude, I believe there must be native and end-to-end support for multivariate metrics. It's not easy because there's an existing ecosystem that needs to be advanced in this direction, but there are efforts underway to improve the situation.

EDIT: added links and comment on wire-size.

sunng87 commented 6 months ago

I just heard of otel-arrow project this morning and had a quick look at its readme. The third goal:

Extend OpenTelemetry data model with native support for multi-variate metrics.

is exactly what I want to archive with this issue. And the columnar approach with Arrow format can surely reduce wire-size and improve ingestion speed for the ecosystem.

However, reading the weaver approach, I feel it's a completely overhaul of current OTel metrics data model. I'm afraid we have a long way to go to start multi-variate transform from a totally new wire protocol and upper API. The whole ecosystem can take years to adopt it. (But I still like the idea of strong-typed metrics, the whole downstream ecosystem, dashboarding, alerting, can benefit from it.)

What if we start from a new type like MultiGauge from current data model, API SDK and OTLP? It can be relatively more approachable, and benefit our smooth switch to arrow based transport eventually.

lquerel commented 6 months ago

@sunng87

When I was talking in step 2 about:

Another possible approach is to adapt the existing generic client SDKs to natively report multivariate metrics...

I was referring to an approach that involves extending generic client SDKs and expanding the OTLP protocol with the concept of multivariate metric (multigauge is too restrictive a concept, in my opinion). I didn't follow this approach for various reasons (i.e. performance, SDK usability), but if you or someone else is willing to update the existing client SDKs and OTLP, then I would be pleased to assist with the integration into the OTel with Apache Arrow protocol (referred to as OTAP later).

Note that it's still a long process to achieve because the adaptation of all the client SDKs, the receivers, processors, and exporters is not a small endeavor, and I'm not totally convinced that the approach I'm following will necessarily take much longer. However, as I mentioned, I'm willing to help on the OTAP adaptation and on the specification of the multivariate metric model if this parallel path is retained.

sunng87 commented 6 months ago

Thank you for the explanation @lquerel . To me, I'm open to both approaches and willing to do some help just to push the transform forward. I can, for example, seek to add an OTAP-native backend for greptimedb for high performance ingestion.

I would like hear from others of the community about our next steps for this.

jmacd commented 5 months ago

@sunng87 Welcome to the project -- I'm looking forward to future work on Arrow and OpenTelemetry integration and excited to see what you are working on!

I feel that this issue has served its purpose, so I will close it and request specific new issues be filed for some of the tangents discussed here.

The OpenTelemetry Protocol with Apache Arrow project has a way to represent multi-gauge observations, however it is built on a set of assumptions that are not very explicit in our specifications. In the OpenTelemetry metrics data model every data point has a timestamp. In the API requirements there is a specific line that was meant to help us, and we take advantage of it:

The API MUST treat observations from a single Callback as logically taking place at a single instant, such that when recorded, observations from a single callback MUST be reported with identical timestamps.

If the SDK is following this guideline, as I understand it, then every data point written by a callback will be identifiable as being part of a multi-observation, and we should not require any new APIs for asynchronous instruments to emit multiple gauges. On the other hand, we do not have a synchronous API for multiple events in request context, which is something that OpenCensus supported. I've considered the idea of a batch synchronous metrics API but it's only narrowly useful and I believe that there are issues of greater importance for OTel metrics.

Since you mentioned Gauge-Histograms, separately, I would like to refer to some older and closed issues on the topic. I think we'll probably be able to find common ground here. https://github.com/open-telemetry/opentelemetry-proto/issues/274 and https://github.com/open-telemetry/opentelemetry-proto/issues/308.