open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.73k stars 888 forks source link

Mistake while applying the logic in #2617 from #2589: process.network.io and system.network.* should not have been split #2726

Closed bogdandrutu closed 2 years ago

bogdandrutu commented 2 years ago

The issue https://github.com/open-telemetry/opentelemetry-specification/issues/2589 explains why the direction for disk is not the right thing, which makes total sense.

But we got that blindly and apply for every "attribute" called "direction" without discussing if that makes sense or not, lucky I thing that is not yet released, and hopefully not implemented by users. As an example the logic in open-telemetry/opentelemetry-specification#2589 does not apply for network metrics, why did we deprecate that? It makes total sense to know the total network bytes.

PS: If you have an argument why I did not say anything, is because in June/July I was PTO, so ...

bogdandrutu commented 2 years ago

/cc @open-telemetry/technical-committee I think this should block the spec release until we clarify this.

tigrannajaryan commented 2 years ago

@bogdandrutu just to clarify, are you saying https://github.com/open-telemetry/opentelemetry-specification/pull/2617 should be reverted? That PR split process.disk.io and process.network.io by eliminating the direction attribute. Do you think both of this metric splitting are wrong?

jmacd commented 2 years ago

In the discussion surrounding this issue, we debated the usefulness and meaningfulness of either having or not having a direction attribute vs having or not having independent metrics. The reason I supported this "change of direction" 😁 is that we have a strong precedent in existing instrumentation favoring separate metrics.

Migration into OTel from all of the old metrics technologies, where precedent favors separate metrics, was looking difficult. We concluded that the meaningfulness was outweighed by a strong precedent and low utility. If the question boils down to "how often do you find yourself plotting the sum of directions?", the answer was "not often". If the question boils down to "how much value is there in binding the two directions together in one metric?" the answer was "not much".

Since the PR in question merged, I found another potential reason to oppose the "change of direction". If we're going to review the arguments, here's another one. In OTEP 176 Columnar Encoding for OTLP specifically this section, @lquerel observes that in a multi-variate metrics system, it makes a lot of sense to maintain the connection that is being lost in #2617. In the example used in the OTEP, it's the state attribute that is considered significant.

For each of these states, the metrics share the same attributes, timestamp, ... Taken individually, these metrics don't make much sense. Knowing the free memory without knowing the used memory or the total memory is not very informative.

Would we make the same statement about network bytes? IMO the case is not as strong, but I'm speaking as a person who has never experienced networking hardware become a bottleneck. It's true that you could identify machines with overloaded NICs by examining the sum of the two metrics.

To address this in the case of system.memory.usage, the OTEP proposes to treat the state attribute as special in the same way that direction is potentially special. We do not have a name for this conceptual property of the state and direction attributes, the fact that erasing these attributes will generally leave you with a constant-value metric. We would have to add system.io.disk.idle (or direction=idle) to indicate how much unused network capacity, to make this a constant sum like system.memory.usage.

The point being made by the OTEP is that when converting from uni-variate to multi-variate metrics, these attributes are distributed across columns of a single row and instead of contributing to multiple rows.

This leaves me undecided because we could encode the same relationship between univariate metrics in several ways. Even if we accept the point made in the OTEP, a naming rule or other metadata about the instruments could inform the system about these relationships. For example, we don't necessarily need to encode this information in the attributes; if we have metrics named system.io.disk.read and system.io.disk.write, we could write a schema file or create a naming convention to establish the connection (e.g., "metrics named system.io.disk. are a limited-sum group", "metrics named system.memory.usage. are a constant-sum group").

bogdandrutu commented 2 years ago

@bogdandrutu just to clarify, are you saying https://github.com/open-telemetry/opentelemetry-specification/pull/2617 should be reverted? That PR split process.disk.io and process.network.io by eliminating the direction attribute. Do you think both of this metric splitting are wrong?

@tigrannajaryan I think based on the discussion in the issue, the disk metric should be split, but the network metric should not be split and reverted.

@jmacd I think there are lots of reasons to split or not. For me we have a rule and the rule says that if the total is meaningful we should do this way, and for the network it is.

Also grouping them as a metric helps with things like "load balancing/sharding", so you get all your points in the same shard, and you require a more complicated sharding scheme in this case for example. And other places where this grouping is lost for no good reasons.

An extra argument is that, points grouped in one metric can be much easier split into different metrics than vice-versa, and we should not emit them separately, because will become almost impossible to re-group them, especially in a distributed system.

tigrannajaryan commented 2 years ago

@tigrannajaryan I think based on the discussion in the issue, the disk metric should be split, but the network metric should not be split and reverted.

@reyang you were able to find for disk I/O metrics that there is not only "read" and "write" data points but also "other" data point. Do you know if something similar exists for network I/O metrics?

bertysentry commented 2 years ago

We need to have the same logic for disk and network. Sometimes network and storage metrics blend, like for iSCSI protocol, FC adapters, etc.

Also, we can't go back and forth with these specifications. Personal anecdote: my company started developing a hardware monitoring specialized Otel collector, first using the direction attribute as specified in the general convention, then split as specified in open-telemetry/opentelemetry-specification#2617. Going back again to direction is going to be annoying. I guess it will be also annoying to other developers trying to keep up.

There are pros and cons for both options, but we have to stick to one.

tigrannajaryan commented 2 years ago

@open-telemetry/specs-approvers This needs to be discussed in the Spec SIG. The PR in question that may need to be reverted is https://github.com/open-telemetry/opentelemetry-specification/pull/2617

tigrannajaryan commented 2 years ago

@codeboten FYI.

tigrannajaryan commented 2 years ago

Discussed in Spec SIG. @reyang will try to find an example of NIC metric where there is also another direction, other than "read" or "write".

reyang commented 2 years ago

I've put my findings here https://github.com/open-telemetry/opentelemetry-specification/issues/2589#issuecomment-1216820303.

bogdandrutu commented 2 years ago

@reyang @tigrannajaryan one thing that definitely this mistake open a huge cane of warms, see https://github.com/open-telemetry/opentelemetry-specification/pull/2706 :) Should that be an attribute or not? When do we apply the "exception". I am very confused, and we are going to do very random decisions based on TC member personal's preference that approves the PR.

Because of that I continue to believe that it is a mistake and should revert, maybe including disk, until we have clear rules that I can follow when I review a PR like that.

It does not seem that we are doing progress, and we are only blocking the release, maybe it is not that bad to rollback and re-discuss that change and make according specs changes to the rules about when to use an attribute or a different metric.

reyang commented 2 years ago

@reyang @tigrannajaryan one thing that definitely this mistake open a huge cane of warms, see open-telemetry/opentelemetry-specification#2706 :) Should that be an attribute or not? When do we apply the "exception". I am very confused, and we are going to do very random decisions based on TC member personal's preference that approves that PR or does not look at that PR.

Because of that I continue to believe that it is a mistake and should revert, maybe including disk, until we have clear rules that I can follow when I review a PR like that.

I believe semantic convention is indeed subjective (e.g. should it be http.url or http.uri or just uri). Maybe we can find some clear rules in certain area that even machines could follow, in most cases I don't think it's possible to use rule-based approach here, unless we want to be super slow and conservative.

It does not seem that we are doing progress, and we are only blocking the release, maybe it is not that bad to rollback and re-discuss that change and make according specs changes to the rules about when to use an attribute or a different metric.

I consider unblocking the spec release as a high-order bit, so if we're blocked here, I would give +1 to unblock it.

tigrannajaryan commented 2 years ago

I agree that reverting is the best right now. We should revert and make a release.

carlosalberto commented 2 years ago

I am very confused, and we are going to do very random decisions based on TC member personal's preference that approves that PR or does not look at that PR.

I agree that given the current state the best is to revert, and figure out a clear recommendation on this front.

That being said, you seem to imply approvers do not review the actual content, including TC members, so please be more vocal about it, so we avoid misunderstandings.

lquerel commented 2 years ago

I think this discussion raises a fundamental issue that is not yet effectively addressed by the existing OTEL protocol, namely a native support for multivariate time-series. This is one of the improvements proposed by OTEP 156 Columnar Encoding for OTLP (not 176).

The attributes "state" and "direction" are artificial ways to encode relationships between metrics that are part of the same multivariate time-series. Unfortunately this encoding, in terms of volume of data generated, is not efficient. The attributes are duplicated for all the metrics participating in the same relationship. Moving the "direction" attribute in a declaration within a schema will not remove this duplication of attributes.

In my opinion, we are surrounded by multivariate time-series (see examples in the OTEP). It is quite common to have a metric whose behavior is influenced by a collection of other metrics (and vice versa) or that its interpretation cannot be done without the knowledge of a collection of other metrics (and vice versa). We collect these metrics in order to interpret and analyze them, so it seems to me that it is useful not to break these relationships between the metrics because otherwise it means that the transport protocol makes these tasks more complex.

My answer to this problem is OTEP 156 which proposes to represent all OTLP entities in columnar form and to extend the metric entities to natively support multivariate time series. This is possible without breaking the existing ecosystem and a phasing is proposed to 1) optimize the bandwidth, 2) optimize the processing at the collector and backends level. If you think this proposal will be approved, then I think an important thing to consider in this discussion is to choose the option that will allow to automate the reconstruction of multivariate time series in the future. I don't have a strong opinion on the "direction" attribute or the use of a statement within a schema. I think that in both cases, this information can be used to perform this reconstruction.

bogdandrutu commented 2 years ago

That being said, you seem to imply approvers do not review the actual content, including TC members, so please be more vocal about it, so we avoid misunderstandings

My English was bad, I meant to say based on a situation when some TC members that are looking at that specific PR have an opinion, and others who have a different opinion don't look at that specific PR (a.k.a a random situation).

bogdandrutu commented 2 years ago

@lquerel I don't agree with the statement that format influenced any of the split decision. The decision to split was mostly because of "total being or not being useful" or because of "ability to search/look for that", so I am confused how that OTEP will if ever influence a decision like this. I do agree that the current OTLP may have some downside in terms of wire size, but I am not sure that matters in this discussion, since OTel semantics are independent of the wire format.

PS: @jmacd you also mentioned that OTEP, still not able to see the relevance.

lquerel commented 2 years ago

@bogdandrutu I do agree with you that OTEL semantics in general have nothing to do with this OTEP. However, IMO the “state” and “direction” attributes are a bit special as they encode some kind of relationship between several metrics. It is only for this specific context that I believe there is a connection with the concept of multivariate time-series if we agree on the following informal definition. A multivariate time-series is a collection of interdependent metrics measured at the same time and sharing the same context (i.e. attributes in the OTEL context). So, if the underlying format natively supports multivariate time-series then I don’t think these two specific attributes would exist. For example in OTEP 156 the metric system.cpu.time is transformed as follows:

Existing OTLP representation (partial representation focusing on the metric definition):

Metric: {
  Name: system.cpu.time
  Data: Sum {
    DataPoints: [
      {
        Attributes: [{key: state, value: idle}, {key: cpu, value: 0}]
        StartTimeUnixNano: 123
        TimeUnixNano: 124
        Value: 0.5
      },
      {
        Attributes: [{key: state, value: user}, {key: cpu, value: 0}]
        StartTimeUnixNano: 12
        TimeUnixNano: 124
        Value: 0.4
      },
      {
        Attributes: [{key: state, value: system}, {key: cpu, value: 0}]
        StartTimeUnixNano: 123
        TimeUnixNano: 124
        Value: 0.3
      },
      ... same repetitive structure with state = {iowait, interrupt, ...}
    ]
    AggregationTemporality: ...
    IsMonotonic: ...
  }

Proposed representation:

StartTimeUnixNano: 123
TimeUnixNano: 124
Attribute: [{key: cpu, value: 0}]      <-- no state here
Metrics: {
  system.cpu.time: {  <-- metric type, desc., unit, aggr. temp., monotonicity are defined in the Arrow schema
      idle: 0.5
      user: 0.4
      system: 0.3
      ...
  },
  ... nothing prevent us here to have other additional related metrics when that make sense 
}

I may be completely off base, as I haven't followed OTEL semantics topic too much but I wanted to bring a slightly different perspective.

bogdandrutu commented 2 years ago

@lquerel great explanation, I think we are on the same page, but also I think that they proposed change (that I am calling as mistake) is moving in the opposite direction:

Your proposed representation

StartTimeUnixNano: 123
TimeUnixNano: 124
Attribute: [{key: cpu, value: 0}]      <-- no state here
Metrics: {
  system.cpu.time: {  <-- metric type, desc., unit, aggr. temp., monotonicity are defined in the Arrow schema
      idle: 0.5
      user: 0.4
      system: 0.3
      ...
  },
  ... nothing prevent us here to have other additional related metrics when that make sense 
}

The "mistaken" change proposed representation

StartTimeUnixNano: 123
TimeUnixNano: 124
Attribute: [{key: cpu, value: 0}]      <-- no state here
Metrics: {
  system.cpu.time.idle: {
      0.5
  },
  system.cpu.time.user: {
      0.4
  },
  ...
  ... nothing prevent us here to have other additional related metrics when that make sense 
}
jmacd commented 2 years ago

moving in the opposite direction

+1

This is why I changed my opinion, splitting metrics when they ought to be part of a single row in a multi-variate representation moves us in the wrong direction.

codeboten commented 2 years ago

Is there any further discussion required for this issue? If so, can we close it w/ the resolution of keeping the direction and protocol attributes as they are in the spec today?

This of course could be changed in the future, if someone comes along with a compelling enough case for splitting it out again, but I think for the short term, keeping the specification as is is preferable to going back and forth yet again.

It's also worth mentioning that if a user would like to split metrics based on those attributes, they could define their own schemas to accomplish this.

tigrannajaryan commented 2 years ago

Is there any further discussion required for this issue? If so, can we close it w/ the resolution of keeping the direction and protocol attributes as they are in the spec today?

This of course could be changed in the future, if someone comes along with a compelling enough case for splitting it out again, but I think for the short term, keeping the specification as is is preferable to going back and forth yet again.

It's also worth mentioning that if a user would like to split metrics based on those attributes, they could define their own schemas to accomplish this.

@codeboten we also reverted the splitting of the network metric, but I think that change was more justified. We reverted simply to begin from a clean slate.

We can close this issue and re-open the network metric issue (I can't find it).

codeboten commented 2 years ago

@tigrannajaryan i couldn't find an issue for it, here's the PR. I can create a new issue if that's the correct way to move forward.

tigrannajaryan commented 2 years ago

@tigrannajaryan i couldn't find an issue for it, here's the PR. I can create a new issue if that's the correct way to move forward.

Yes, please open an issue so that we can discuss and decide what we want to do about it.

codeboten commented 2 years ago

@tigrannajaryan https://github.com/open-telemetry/semantic-conventions/issues/649 is now open.

bertysentry commented 2 years ago

I'll take care of the hardware metrics

tigrannajaryan commented 2 years ago

I am closing this issue since the problem it uncovered is now fixed (in the sense that the offending PRs are reverted). Further discussion on the topic will happen on the new issues that are already created.