Open rogercoll opened 1 month ago
Looking into this again, I find it weird that we encode the linux
information within the metric name.
If a metric is supported by an OS or not is an implementation detail that might change in the future.
We hit the same in some host's resource attributes, like at the host.cpu.model.id
which the collector skips: https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/5601dfb28d905d0802f2e450cad6a2cc80029b6b/processor/resourcedetectionprocessor/internal/system/system.go#L202-L207. Should we encode the os information in these too?
@open-telemetry/semconv-system-approvers check me here but wouldn't be just fine if we document in which operating systems a metric is supported and keep the name unified?
This was discussed in the semantic conventions meeting on Monday Sept 23 2024. The following is the summary.
We started by discussing the current split of precedence between the OS being present in the name or not. There are multiple system-exclusive attributes and metrics that do and don't have OS in the name.
The goal is to come to a resolution about whether we should go one way or the other.
First, there was some discussion about whether the different OS's should have their own namespaces, as done in linux.memory.slab.state
. This decision may line up with how this played out for language runtimes. For example, what used to be runtime.jvm.*
eventually became jvm.*
, and similarly for other languages in semconv.
Then we discussed the metrics/attributes that actually do have the OS in the attribute names. I mentioned how the OS being in the name for certain things like system.linux.memory.available
and the proposed process.windows.handles
, the OS was included in the metric names to clarify directly in the name that this metrics map to the OS specific concepts for these, lest they be mistaken as more generally applicable concepts when they can't be.
(CC @lmolkova and @jsuereth to ensure I'm not misrepresenting)
Whether the OS should be in a name or not should come down to a judgement call based on the expected usecase of the metric/attribute.
The goal of the System Semconv Working Group should be to apply our expertise in the observability of hosts/systems to provide a recommendation of the observability use cases. Things like "how healthy is my system", "how much memory is my system using", "are we close to crashing"; the highest level use cases that we expect users to run into. From those we should come up with the broader API that applies on either system, the most basic metrics that will allow them to observe the most common scenarios. When we have these kinds of metrics that are crucial and relatively general to any platform, likely we should not include the OS in these names, and this should comprise our generally recommended set of instrumentation.
However, we also want to make it so the conventions we provide also cover deeper concepts. The way I like to think of it is the kinds of information that a Linux/Windows sysadmin would know about the system they specialize in and gather themselves. We want to ensure our conventions can support them if they need to go deeper. In these cases, APIs or concepts that are very specific to an operating system, such as the Linux kernel's Memory Available information or Win32 Process Handles, can include the operating system names, and will not be included in the most general recommended instrumentation. (That being said, it's possible we provide recommended instrumentation packages for specific operating systems, i.e. "here's the instrumentation you should use on a Windows system" and supply special Windows dashboards or alerts etc).
CC @open-telemetry/semconv-system-approvers
(Me speaking now) I think we should consider the instrumentation we define as two different buckets:
For much of this working group's time we've struggled with how Bucket 2 sometimes clashes against the way semconv metrics should work. Based on prior experiencing doing general observability of hosts as well as maintaining instrumentation in the Collector, we've found that there do exist users who want Bucket 2 information because they know exactly what they are looking for. I don't think we should leave them in the dust, we should make sure we provide conventions for them so that an OTel instrumentation provider knows how to provide that specific information. But we should try and delineate them from the more general recommended instrumentation, which should strive to be as user friendly as possible and really uphold the semantic conventions vision. (this isn't to say bucket 2 should flagrantly break rules, but to accept the exceptional situation they are in where we want to define direct mappings to OS concepts).
We have already started discussions about moving towards stability and making decisions about what exactly our recommended set of instrumentation comprises. I think coming up with the most important use cases we want to cover will help us to determine what metrics fall into which buckets as described above. When we have a grasp on that separation, that will subsequently guide us for whether the OS should be in a given metric name or not.
There are a number of metrics that are in question related to this issue (existing like host.cpu.model.id
, system.linux.memory.available
, and upcoming like process.handles
, process.cgroup
) and our decision for whether the OS name should be in any of those should stem from us doing an audit of our existing and upcoming instrumentation and decide whether it falls into the General or Specific usecases.
Discussed on the 2024-10-10 System semantic conventions WG meeting, the next steps would be to
We also discussed existing conventions and what bucket do they fall into:
Area(s)
area:system
Is your change request related to a problem? Please describe.
Should system semantic conventions attributes and/or metrics specific to an OS include it as a namespace? At the moment, there are example of both options:
With {os} namespace
system.linux.memory.available
Without the {os} namespace
process.owner
; Windows specificprocess.open_file_descriptor.count
; Linux specific metricDescribe the solution you'd like
Whether all attributes/metrics specific to an OS contain the OS name in its namespace or not.
This concern was raised during a @open-telemetry/semconv-system-approvers SIG while considering the implications of using the
linux
prefix for theprocess.cgroup
attribute.Describe alternatives you've considered
No response
Additional context
No response