rogercoll commented 1 month ago

Area(s)

area:system

Is your change request related to a problem? Please describe.

Should system semantic conventions attributes and/or metrics specific to an OS include it as a namespace? At the moment, there are example of both options:

With {os} namespace

linux.memory.slab.state attribute
system.{os} metrics; an example would be system.linux.memory.available

Without the {os} namespace

process.executable.build_id.gnu attribute; gnu specific
process.owner; Windows specific
process.open_file_descriptor.count; Linux specific metric

Describe the solution you'd like

Whether all attributes/metrics specific to an OS contain the OS name in its namespace or not.

This concern was raised during a @open-telemetry/semconv-system-approvers SIG while considering the implications of using the linux prefix for the process.cgroup attribute.

Describe alternatives you've considered

No response

Additional context

No response

ChrsMark commented 1 month ago

Looking into this again, I find it weird that we encode the linux information within the metric name. If a metric is supported by an OS or not is an implementation detail that might change in the future.

We hit the same in some host's resource attributes, like at the host.cpu.model.id which the collector skips: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5601dfb28d905d0802f2e450cad6a2cc80029b6b/processor/resourcedetectionprocessor/internal/system/system.go#L202-L207. Should we encode the os information in these too?

@open-telemetry/semconv-system-approvers check me here but wouldn't be just fine if we document in which operating systems a metric is supported and keep the name unified?

braydonk commented 1 month ago

Semantic Conventions Meeting Sept 23, 2024 Summary

This was discussed in the semantic conventions meeting on Monday Sept 23 2024. The following is the summary.

Dual precedence

We started by discussing the current split of precedence between the OS being present in the name or not. There are multiple system-exclusive attributes and metrics that do and don't have OS in the name.

linux.memory.slab.state - linux only
system.linux.memory.available - linux only
process.owner - windows only
process.cgroup (proposed in semconv, exists in the collector) - linux only
process.handles (proposed in semconv, exists in the collector) - windows only

The goal is to come to a resolution about whether we should go one way or the other.

Should there be an OS root namespace?

First, there was some discussion about whether the different OS's should have their own namespaces, as done in linux.memory.slab.state. This decision may line up with how this played out for language runtimes. For example, what used to be runtime.jvm.* eventually became jvm.*, and similarly for other languages in semconv.

Should the OS be in metric and attribute names?

Then we discussed the metrics/attributes that actually do have the OS in the attribute names. I mentioned how the OS being in the name for certain things like system.linux.memory.available and the proposed process.windows.handles, the OS was included in the metric names to clarify directly in the name that this metrics map to the OS specific concepts for these, lest they be mistaken as more generally applicable concepts when they can't be.

Recommendation from Semconv Maintainers

(CC @lmolkova and @jsuereth to ensure I'm not misrepresenting)

Whether the OS should be in a name or not should come down to a judgement call based on the expected usecase of the metric/attribute.

The goal of the System Semconv Working Group should be to apply our expertise in the observability of hosts/systems to provide a recommendation of the observability use cases. Things like "how healthy is my system", "how much memory is my system using", "are we close to crashing"; the highest level use cases that we expect users to run into. From those we should come up with the broader API that applies on either system, the most basic metrics that will allow them to observe the most common scenarios. When we have these kinds of metrics that are crucial and relatively general to any platform, likely we should not include the OS in these names, and this should comprise our generally recommended set of instrumentation.
However, we also want to make it so the conventions we provide also cover deeper concepts. The way I like to think of it is the kinds of information that a Linux/Windows sysadmin would know about the system they specialize in and gather themselves. We want to ensure our conventions can support them if they need to go deeper. In these cases, APIs or concepts that are very specific to an operating system, such as the Linux kernel's Memory Available information or Win32 Process Handles, can include the operating system names, and will not be included in the most general recommended instrumentation. (That being said, it's possible we provide recommended instrumentation packages for specific operating systems, i.e. "here's the instrumentation you should use on a Windows system" and supply special Windows dashboards or alerts etc).

CC @open-telemetry/semconv-system-approvers

(Me speaking now) I think we should consider the instrumentation we define as two different buckets:

Bucket 1: General concepts that cover the most important usecases
Bucket 2: Specialized information that maps directly to operating system concepts

For much of this working group's time we've struggled with how Bucket 2 sometimes clashes against the way semconv metrics should work. Based on prior experiencing doing general observability of hosts as well as maintaining instrumentation in the Collector, we've found that there do exist users who want Bucket 2 information because they know exactly what they are looking for. I don't think we should leave them in the dust, we should make sure we provide conventions for them so that an OTel instrumentation provider knows how to provide that specific information. But we should try and delineate them from the more general recommended instrumentation, which should strive to be as user friendly as possible and really uphold the semantic conventions vision. (this isn't to say bucket 2 should flagrantly break rules, but to accept the exceptional situation they are in where we want to define direct mappings to OS concepts).

We have already started discussions about moving towards stability and making decisions about what exactly our recommended set of instrumentation comprises. I think coming up with the most important use cases we want to cover will help us to determine what metrics fall into which buckets as described above. When we have a grasp on that separation, that will subsequently guide us for whether the OS should be in a given metric name or not.

There are a number of metrics that are in question related to this issue (existing like host.cpu.model.id, system.linux.memory.available, and upcoming like process.handles, process.cgroup) and our decision for whether the OS name should be in any of those should stem from us doing an audit of our existing and upcoming instrumentation and decide whether it falls into the General or Specific usecases.

mx-psi commented 1 month ago

Discussed on the 2024-10-10 System semantic conventions WG meeting, the next steps would be to

add guidance regarding how to decide what bucket a convention falls into on the Contributing guidelines
create a separate issue for auditing existing conventions and applying this rule

We also discussed existing conventions and what bucket do they fall into:

linux.memory.slab.state and system.linux.memory.available look like they fall into bucket 2
process.owner seems to fall into bucket 1, we may want to rename to a more platform-agnostic name
Both process.cgroup (proposed in semconv, exists in the collector) and process.handles (proposed in semconv, exists in the collector) look like they would fall into bucket 2

open-telemetry / semantic-conventions

Clarify OS specific system attributes/metrics namespace #1403