open-telemetry / opentelemetry-specification

Specifications for OpenTelemetry
https://opentelemetry.io
Apache License 2.0
3.71k stars 887 forks source link

Is direction dimension a mistake in semantic conventions? #2589

Closed tigrannajaryan closed 2 years ago

tigrannajaryan commented 2 years ago

We have metrics that have a "direction" attribute, for example disk metrics.

This seems to be in contradiction to the general guidelines about attributes

"As a rule of thumb, aggregations over all the attributes of a given metric SHOULD be meaningful," as Prometheus recommends.

I am not sure what aggregation is meaningful for "direction" attribute.

Should the attribute be removed and the metrics split into 2, one for read and one for write?

It does look pretty natural to express this dimension as an attribute, it is low cardinality, it groups 2 very related measurements. Is this an indication that the guidelines need to be adjusted or this is an exception from the rule? Or perhaps there is a meaningful aggregation of "direction" that I just don't see?

reyang commented 2 years ago

+1, also in most cases the removal of this dimension would result in meaningless value (e.g. what does a "total read/write bytes" mean for a disk?)

In addition, many disk/NICs have extra operations beyond read and write (I just took a screenshot from my laptop):

image

bogdandrutu commented 2 years ago

@tigrannajaryan does the aggregation not represent "total" bytes send/received over the network? Is that not meaningful?

tigrannajaryan commented 2 years ago

@tigrannajaryan does the aggregation not represent "total" bytes send/received over the network? Is that not meaningful?

Yes, probably meaningful for network (e.g. total bytes/throughput can be interesting to know as a percentage of bidirectional bandwidth available for half-duplex physical connections).

Is it also meaningful for disks? I am not sure what's the meaning of total combined bytes written and read for disk. Sure, both are bytes flowing from/to the same device, maybe that's useful somehow?

I am not against keeping the direction if we think it is meaningful (but see also @reyang comment which shows we do not fully understand the allowed values for the dimension).

codeboten commented 2 years ago

Speaking specifically to @reyang's example, it would appear that aggregating read/writes/other would be meaningful as per the following description of "I/O other bytes" :

"I/O Other Bytes" is "The number of bytes transferred in input/output operations generated by a process that are neither reads nor writes, including file, network, and device I/Os

jmacd commented 2 years ago

I believe direction was a mistake in our semantic conventions. I support undo-ing it. Prometheus documentation addresses this specifically:

Read/write and send/receive are best as separate metrics, rather than as a label. This is usually because you care about only one of them at a time, and it is easier to use them that way.

I consider this a good exception to the rule otherwise given by Prometheus quoted above:

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics.

The issue about directional transfers tests this rule: we are able to find ways that the sum() of network traffic is meaningful and even useful. The usefulness/meaningfulness needs to be weighed against potential harm due to "uselessness". If we believe "usually you care about only one of them at a time", and we believe the common mode of viewing metrics is to aggregate them, there's real harm in having a direction label as it will obscure two useful things in a commonly-useless way.

Is there an underlying logic we can use for similar situations? Here's one answer: in what sense are network bytes read and written the same? In what sense are they different? It makes sense to sum() or rate(sum()) these counters when they are the same, and it does not make to sum() or rate(sum()) these counters when they are not the same. Of course, a byte is a byte, so the sum is meaningful (units are "By"). However, if a metric is being used as a proxy variable for load -- and the load induced by a byte read is different than the load induced by a byte written, then it is harmful/useless to combine them (units are "{bytes_read}" and "{bytes_written}").

carlosalberto commented 2 years ago

I believe direction was a mistake in our semantic conventions. I support undo-ing it.

At least we should start by addressing this with upstream metrics addition (and yes, later on update the existing ones ;) )

bertysentry commented 2 years ago

I have mixed feelings about this. If you take the network traffic example, you can calculate the bandwidth utilization with something like: max(rate(system.network.io)) / hw.network.speed, where the max() function applies to each direction of the system.network.io metric, for a full-duplex link.

I don't see how we could calculate the same bandwidth utilization metric from system.network.io.receive and system.network.io.transmit.

Also, as noted by @reyang, having separate metrics is reasonable when you have a strictly defined set of dimensions. It does fit well with network (receive vs transmit), but doesn't even for a simple metric like system.disk.io.

Last point: What are the drawbacks of using a direction attribute versus a separate metric? As explained above and by others, there are a few minor benefits of using the direction attribute, but I can't see anything in the "cons" column. Performance? Readability? Consistency?

bertysentry commented 2 years ago

This change will require changes in 7 receivers for consistency:

reyang commented 2 years ago

During the Aug. 16th, 2022 Specification SIG meeting, there were discussions about network metrics - based on the comment here folks believe that Disk I/O should NOT have a dimension/attribute called direction, the question is whether Network I/O should do the same or not, and issue #2726 suggested Network I/O should keep direction.

Here are what I've found so far about Windows operating systems (NT kernel based):

  1. "Network" is a vague term, NT models network performance counters as several things:
  2. Process only has I/O Read/Write/Other Bytes/Operations, it doesn't have direct connection to the underlying Network Interface nor Network Adapter. And it does have "Other Bytes" (I can imagine a process trying to listen on a TCP port, which is considered as control bytes but not Rx/Tx).
sebastien-rosset commented 1 year ago

Shouldn't direction be renamed to something like hw.io_direction?

bertysentry commented 1 year ago

@sebastien-rosset I don't see a reason to use a specific attribute for hardware metrics. direction is specified at the general level for IO metrics so it shall be used in specific semantic conventions that cover IOs as well.

sebastien-rosset commented 1 year ago

The spec states the instrument should have attributes for direction; oddly, plural is used, and it's not stating the attribute should be named direction.

io - an instrument that measures bidirectional data flow should be called entity.io and have attributes for direction. For example, system.network.io.

That would also break the semantic rules:

  1. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/semantic_conventions/README.md#general-guidelines and
  2. https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/common/attribute-naming.md#attribute-naming

Metric names and attributes exist within a single universe and a single hierarchy. Metric names and attributes MUST be considered within the universe of all existing metric names. When defining new metric names and attributes, consider the prior art of existing standard metrics and metrics from frameworks/libraries.

Perhaps it should be hw.io_direction. Another namespace could have a different concept of "direction" such as up, down, or East, West, left, right, etc. For example a wind sensor could have a direction attribute.

In addition, the spec states:

Names SHOULD NOT coincide with namespaces. For example if service.instance.id is an attribute name then it is no longer valid to have an attribute named service.instance because service.instance is already a namespace. Because of this rule be careful when choosing names: every existing name prohibits existence of an equally named namespace in the future, and vice versa: any existing namespace prohibits existence of an equally named attribute key in the future.

Defining lots of attributes with no name hierarchy has the side effect of preventing any future use of that word as a namespace. Currently, I can see these attributes are declared with a flat naming structure: direction, vendor, model, serial_number, vendor, firmware_version, bios_version, driver_version, firmware_version, id, name, parent, chemistry, capacity, limit_type, state, sensor_location, type, task, raid_level, logical_addresses, physical_addresses, smart_attribute.

capacity is defined as the capacity in Watts-hours or Amper-hours, which is highly specific to batteries. However, the word capacity is contextual, it could mean something entirely different in another namespace. It seems that by allowing names to be flat, more flat-names will be created in short order, which inevitably will lead to collisions or other naming problems. At some point, shouldn't there be a registry of namespaces, similar to what was done at the IETF or IEEE?

Under the semantic guidelines, HTTP, RPC, Database, System, Process, Runtime Environment, FaaS are consistently using hierarchical names (with state noted as an exception). Only "Hardware" uses flat names liberally.

Out of curiosity, I looked at all the possible values for state:

state: idle | user | system | ... state: used | cached | free | ... state: idle, used (for database connection pool) state: idle, user, system, interrupt, etc state: used, free, cached, etc. state: used, free, reserved state: close, close_wait, closing, delete, established, fin_wait_1, fin_wait_2, last_ack, listen, syn_recv, syn_sent, time_wait. state: system, user, wait state: ok, degraded, failed state: charging, discharging state: ok, degraded, failed, charging, discharging state: ok, degraded, failed, predicted_failure state: ok, degraded, failed, open state: remaining state: ok, degraded, failed, needs_cleaning

sebastien-rosset commented 1 year ago

I created #3129 since it's a different issue