open-telemetry / opentelemetry-proto

OpenTelemetry protocol (OTLP) specification and Protobuf definitions
https://opentelemetry.io/docs/specs/otlp/
Apache License 2.0
581 stars 252 forks source link

Decouple Number Type from Point Kind in OTLP protocol #264

Closed victlu closed 3 years ago

victlu commented 3 years ago

The OTLP message Metric has the following definition for instrument types...

spec

  oneof data {
    IntGauge int_gauge = 4;
    DoubleGauge double_gauge = 5;
    IntSum int_sum = 6;
    DoubleSum double_sum = 7;
    IntHistogram int_histogram = 8;
    DoubleHistogram double_histogram = 9;
    DoubleSummary double_summary = 11;
  }

I propose we only define Instrument Types (i.e. Gauge, Sum, etc...) at the message Metric level.

The data type (i.e. int, double, etc...) should be lower level, maybe at the message DataPoint level.

i.e.

  message DataPoint {
    // ...

    // value itself.
    // sfixed64 value = 4;     <= Replace

    oneof value {
       sfixed64 Int64Value = 4;
       double doubleValue = 5;
    }
  }
rakyll commented 3 years ago

This would require at least one data point to arrive in order for exporters to create metric descriptors in the backend. What are the use cases where the current data model is not working for you?

jmacd commented 3 years ago

I've proposed similar, #172, but at this point I'm happy with what we have.

bogdandrutu commented 3 years ago

@victlu the problem with this approach, is that if there are systems that consider the data point kind as part of the identity for a metric will be impossible to have that. For example stackdriver in GCP does that, they consider an int sum different than and double sum, and they need to know if the metric will have one or the other.

victlu commented 3 years ago

@bogdandrutu, In your case, I would think it serves them better to explicitly name their metrics accordingly (as they want two time series) rather than by the data type.

I think this question is related to #1366. How do we consider the "name" to be unique in identifying a time series.

victlu commented 3 years ago

This would require at least one data point to arrive in order for exporters to create metric descriptors in the backend. What are the use cases where the current data model is not working for you?

@rakyll Why can't metric descriptors be created (without data type)? Or are you saying a descriptor MUST include the data type? In which case, that's the argument I'm making here...

IMHO, metric backend systems ultimately shows/graphs/alerts/store/etc... numeric based Time-Series. And for performance and efficiency, they want to use the most efficient representation of the "number". Thus, promotion of data types are common especially for storage considerations. The important thing is that it's just a number regardless if its transmitted as a byte, short, ushort, int, uint, long, ulong, float, double, etc...

bogdandrutu commented 3 years ago

Or are you saying a descriptor MUST include the data type? In which case, that's the argument I'm making here...

If the type is included in the metric descriptor/name what is the benefit of doing what you suggested? Also your suggestion will add an extra check possible that you need to verify the type in the descriptor matches the type in the points.

Also on top of everything that adds extra allocations on serialization/deserialization from proto bytes.

rakyll commented 3 years ago

@victlu Some backends like Stackdriver see the type as an integral part of the metric. See the value_type at https://cloud.google.com/monitoring/api/ref_v3/rpc/google.api#google.api.MetricDescriptor. This model gives us an easy/cost effective way of identifying the data type in metric creation. These backends will reject if send a floating number to an integer metric, and it'd require validation of the data points in the exporter as @bogdandrutu suggests.

victlu commented 3 years ago

I am trying to understand, but I have a few questions...

From last few responses, it seems we are specializing OTLP protocol based on specific vendor's backend implementation? I assume we would do this for performance? We should call out what backend system we are optimizing for in this case.

What happens if a vendor does not need to separate ints/doubles, then it would need to merge ints and doubles together? That does not seem performant to me either.

I would think the Spec is more "higher-level" to convey the meaning and semantics of metrics. The data type of data points seems more like details.

Looking at a few existing client APIs (i.e. Prometheus, micrometer) , I don't see Counters/Gauge/Histogram/etc... decorated with a data type (i.e. IntCounter/DoubleCounter/etc...). Or they just take double to cover both int/double. What should we do here?

How will this work if I'm using a type-less language (i.e. javascript)? I assume I will need to separate out ints/doubles/etc... into different metrics?

I'm also not suggesting every data point needs to be type-less. I just don't see the data-type being at the topmost level with semantics like Sum/Gauge/Histogram/etc... Maybe we have oneof IntTimeSeries/DoubleTimeSeries under the oneof Sum/Gauge/Histogram/etc.

victlu commented 3 years ago

I propose we change to following...

message Metric {
  // ...

  oneof data {
    Gauge gauge = 4;
    Sum sum = 5;
    Histogram histogram = 6;
    Summary summary = 7;
  }
}

message Gauge {
  repeated IntDataPoint int_data_points = 1;
  repeated DoubleDataPoint double_data_points = 2;
}

message Sum {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  // If "true" means that the sum is monotonic.
  bool is_monotonic = 3;

  repeated IntDataPoint int_data_points = 4;
  repeated DoubleDataPoint double_data_points = 5;
}

message Histogram {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  repeated IntDataPoint int_data_points = 3;
  repeated DoubleDataPoint double_data_points = 4;
}

message Summary {
  repeated DoubleDataPoint double_data_points = 1;
}
bogdandrutu commented 3 years ago

For Summary see https://github.com/open-telemetry/opentelemetry-proto/pull/269 and for Historgram see https://github.com/open-telemetry/opentelemetry-proto/pull/270

bogdandrutu commented 3 years ago

@victlu I kind of like your proposal, but I would like to understand how do you see this as fixing an issue? I think this is a nice cosmetic improvement, but I am not clear if this solves a real issue or we see this as just an cosmetic improvement :)

Update: @victlu how do we deal with a case where both repeated fields contain values?

bogdandrutu commented 3 years ago

@jmacd what do you think about this improvement in readability https://github.com/open-telemetry/opentelemetry-proto/issues/264#issuecomment-789241623 ?

victlu commented 3 years ago

The change I'm trying to affect is to decouple the semantic of an instrument (i.e. Counter vs Gauge vs Histogram) from the details of the data type (bytes/int/long/double/float/etc). I think it may have these beneficial consequences.

victlu commented 3 years ago

Separately, I think we should group by Labels first before we have DataPoints to minimize on buffer size. I'll file a separate issue for this.

jmacd commented 3 years ago

@jmacd what do you think about this improvement in readability?

I like it, but it raises the question you asked.

@victlu how do we deal with a case where both repeated fields contain values?

That's a tough one. We discussed in yesterday's data model meeting that SDKs should not generate conflicting types (loosely defined), and we discussed that we would like the collector to pass-through conflicting types (again loosely defined) as if they were separate metrics. This appearance of both double and integer repeated fields makes it possible to have a single Metric that looks "not separate", in the sense that it contains multiple number types. I think if we allow what @victlu proposes, we should firmly state that OTel welcomes you to mix integer and double metrics--that this is not a semantic conflict--just that we would like you not to do so in the same SDK.

Naturally we could use a oneof to prevent mixing integer and double values. We did a lot of benchmarking last summer to talk ourselves out of oneof messages as much as possible.

Even when we allow mixing repeated integer- with repeated double-values, it will be difficult to work with this mixture of values without combining and interleaving them (sort by time) into an intermediate representation, which probably brings up the problem discussed next.

Separately, I think we should group by Labels first before we have DataPoints to minimize on buffer size. I'll file a separate issue for this.

Yes, let's keep that separate. I recall a historical conversation, but would prefer not to dig up old threads. @bogdandrutu do have thoughts? I believe we were avoiding additional message layers where possible. The best argument in favor of the current design that I remember is that an SDK should never output more than one point per metric and label set per collection period. So, the extra layer for label set will always have 1 entry in the encoding @victlu proposes, under a simple collection model. I haven't thought through the implications of this in the collector.

I think the greater win would be to move labels and label sets into separate regions in the Export request, with a more-compact encoding.

jsuereth commented 3 years ago

@victlu When you first mentioned this I was skeptical, but I do like the simplicity proposed here. My only concern right now is the amount of churn this could cause right now. Ideally this should be done via a "deprecation" and "transition" period, as we DO have implementations that folks are trying to use with metrics.

Also to @jmacd and @bogdandrutu IIUC "oneof" is slow in implementation in Go, right? That's not a fundamental limitation in protocol buffers AFAIK, it's basically just the same as "optional" but with some runtime checks.

jmacd commented 3 years ago

Like @jsuereth I'm worried about churn. I'd like to connect the question posed here, about reducing the number of types inside the Metric oneof with the question @bogdandrutu poses about histograms in #259. I propose we adopt the same approach for both of these, whichever it is.

On the one hand, we can have just the four semantically distinct types (Counter, Gauge, Histogram, Summary) and push all their variation into the types themselves. This is Victor's proposal for Gauge/Sum, and this is my proposal for Histogram in #272.

On the other hand, we can go with a distinct type for each variation in the Metric oneof. Right now, we have a distinct type for IntGauge and DoubleGauge, for example, and the question is whether we should keep that approach or not. If we keep that approach, then @bogdandrutu's point in #259 should be addressed and each distinct form of Histogram buckets gets its own type.

jsuereth commented 3 years ago

Agree w/ @jmacd on consistency between histogram + metric types. If we go with #272 then we should also take this proposal/direction (albeit I'd like to see a smooth transition accompany the propsoal).

@victlu If you want to take a crack at what migration from existing OTLP to this proposal looks like, that'd be good. This will disrupt a lot of the ecosystem in OTel and I know there's already some users of metrics where such a large change could break them if it's not done smoothly. I realize metrics isn't marked "stable" but users are users. If you want help thinking through a smooth-transition plan, I'm happy to take time to brainstorm with you.

victlu commented 3 years ago

I'll take a crack at it. @jsuereth I'll reach out on slack as I am sure I'm not up-to-speed on all the intricacies of this task.

hdost commented 3 years ago

I'd agree with the approach similar to #272 add the new form with simplistic names. I assume there would need to be changes in the OLTP exporter. Should there be a a relate ticket for the "supported" languages? Which of those are there? ** Maybe a config option added to the exporter in order to switch them in the short term, dual write, etc. How long of a transition period would be needed? If there's supposed to be a stable Model https://github.com/open-telemetry/opentelemetry-specification/milestone/16 1 month seems a little tight given a number of the open areas.

Edit: I have created an issue #275 discuss this further since it seems to span a few issues.

victlu commented 3 years ago

Proposal for smooth transition for this (and other) protocol changes.

Migrating OTLP protocol

Strategy

To deliver OTLP protocol changes, the following proposed steps should be followed to achieve a smooth transition/migration for all affected components.

  1. Apply changes to OTLP protocol

    • Include new OTLP messages (must be "additive" and expect no incompatibilites with existing components)
    • Comment/mark old OTLP messages "depricated" with expected transition date
    • Release changes
      • Coordinate multiple changes into appropriate batches as necessary
  2. Update and release all dependant components

    • Components to remove support for old messages/protocol
    • Components to support new messages/protocol
  3. Wait till after transition date

    • Review all affected components are updated
    • Remove old OTLP messages marked "depricated"
    • Release final changes
  4. Repeat as necessary

For issue #264

Mark old messages as "depricated" and add new messages...

oneof data {
    // DEPRICATED BY 4/15/2021
    IntGauge int_gauge = 4;
    DoubleGauge double_gauge = 5;
    IntSum int_sum = 6;
    DoubleSum double_sum = 7;
    IntHistogram int_histogram = 8;
    DoubleHistogram double_histogram = 9;
    DoubleSummary double_summary = 11;

    // New messages
    Gauge guage = 12;
    Sum sum = 13;
    Summary summary = 14;
    Histogram histogram = 15;
}

Add new messages...

message Gauge {
  repeated IntDataPoint int_data_points = 1;
  repeated DoubleDataPoint double_data_points = 2;
}

message Sum {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  // If "true" means that the sum is monotonic.
  bool is_monotonic = 3;

  repeated IntDataPoint int_data_points = 4;
  repeated DoubleDataPoint double_data_points = 5;
}

message Histogram {
  // aggregation_temporality describes if the aggregator reports delta changes
  // since last report time, or cumulative changes since a fixed start time.
  AggregationTemporality aggregation_temporality = 2;

  repeated IntDataPoint int_data_points = 3;
  repeated DoubleDataPoint double_data_points = 4;
}

message Summary {
  repeated DoubleDataPoint double_data_points = 1;
}
jsuereth commented 3 years ago

@victlu I think the transition is a bit more dramatic here. Remember that OTLP is an in-flight protocol and deprecations have a slow-rollout between producers/consumers, so you need to call that out in the transition. I.e. what you define may be a little too aggressive/churny for end users.

Here's an abridged update to what you had. I think maybe this deserves a full OTEP, but for the purposes of your change we just need to agree to the general outline.

Migrating OTLP protocol

Strategy

To deliver OTLP protocol changes, the following proposed steps should be followed to achieve a smooth transition/migration for all affected components.

  1. Apply changes to OTLP protocol
    • Include new OTLP messages (must be "additive" and expect no incompatibilites with existing components)
      • For the Protocol Buffer protocol this means message names can change, but not numbers
      • For JSON protocol, definition is TBD. (Note: JSON is not marked stable yet)
    • Denote all deprecations semantics
      • Mark fields/messages as deprecated
      • for PB use [deprecated=true], optionally rename field to deprecated_
      • for JSON, TBD.
      • Annotate interpretation semantics for deprecated fields:
      • If necessary describe how to interpret from deprecated field to new field
      • Define who is allowed to use deprecated semantics. E.g. "As of release X, all new clients may not use this field"
    • Release changes
      • Coordinate multiple changes into appropriate batches as necessary
  2. Update all consumer components of OTLP
    • Update "core" code to use "new" fields/messages.
    • Ensure components appropriately transition from deprecated messages/fields to "new" or "replacement" fields/messages.
    • Consumers MUST NOT remove deprecated field handling until EOL of the protocol they were introduced in.
  3. Update all producers of OTLP
    • Update all producers of OTLP (SDKs/Exporters) to only produce new/replacement fields/messages
    • All producers MUST NOT use deprecated messages to declare OTLP version compliance
  4. Wait till EOL date for deprecation
    • Remove handling from consumer components of OTLP and release them.
    • Remove old OTLP messages marked "deprecated"
    • Release final changes
  5. Repeat as necessary
jmacd commented 3 years ago

Please review #278.

victlu commented 3 years ago

With #278, my concern has been resolved! Thank you @jmacd .

jmacd commented 3 years ago

Let's leave this open until the PR merges 😀