Guidance needed: `process` vs `system` vs `container` vs `k8s` vs runtime metrics

We have multiple layers of metrics:

runtime-specific (JVM, Go, etc) reporting CPU, memory, etc from the runtime perspective with attributes specific to the runtime
process which reports OS level metrics per process as observed by the OS itself
system metrics that report OS metrics from the OS perspective
container metrics that are reported by the container runtime about container
k8s metrics are coming https://github.com/open-telemetry/semantic-conventions/issues/1032

Plus we have attributes is all of these namespaces that have something in common:

Problems:

When adding new metrics, such as system.linux.memory.available (https://github.com/open-telemetry/semantic-conventions/pull/1078), it's not clear if we'd expect to have OS-specific metrics in each of the namespaces (container.linux.memory.*, system.linux.*, process.linux.memory.*) https://github.com/open-telemetry/semantic-conventions/pull/1078#discussion_r1638375208
We end up defining similar attributes in each namespace
We should come with a framework to decide how/if to extend system/container metrics:
- Do we need separate container and k8s metrics? Could the container orchestrator be an attribute?
- Do we need separate process and system metrics - isn't system.cpu.time is a sum of all process.cpu.time on the same machine?

cc @open-telemetry/semconv-system-approvers @open-telemetry/semconv-container-approvers

I can try to address some of the problems.

It's not clear if we'd expect to have OS-specific metrics in each of the namespaces

I don't think we've written the expectation anywhere. There was a discussion about this in a PR from almost a year ago that was adding a platform specific process metric, where it seemed like process.windows was the direction to take for this one metric: https://github.com/open-telemetry/semantic-conventions/pull/142#discussion_r1245574693
I'd be surprised to see more platform specific metrics in the process namespace but I still think the name for the metric should be process.windows.handles; there is simply no elegant way to name and describe the metric in a cross-platform way that isn't needlessly confusing; handles are a completely Windows-specific concept.

We end up defining similar attributes in each namespace

That's true, but in the case of cpu.state the biggest challenge there was that the attributes defined in each namespace were similar, but had different expected values. When used on process.cpu.time, cpu.state only has 3 expected values, but it has more different ones when used in other contexts. So the different *.cpu.state attributes were similar but they benefit from having their minor contextual differences be separated by being different named attributes, rather than trying to force it all into one shared cpu.state attribute. (I've been out for a while, but this was where the discussion was left last time I was part of it. It's possible we've worked past this already, so apologies if that's the case)

Do we need separate process and system metrics?

Right now there are three process metrics that don't have system equivalents:

process.context_switches
process.thread.count
process.open_file_descriptor.count

Those are metrics that I would probably only expect to see on a process resource.

However if I'm understanding correctly, the preferred state would be instead of system and process metric namespaces, these metrics essentially not be namespaced? So instead of system.cpu.time and process.cpu.time there be just cpu.time? I could probably see that working, though we'd probably run into the same problem with merging the cpu.state attribute; how do we handle detailing the minute differences for when cpu.time is reported for a host resource vs for a process resource? I don't know which way is better; I've personally always been on the side of keeping process and system stuff completely separate and just deal with the repetition, but I sense that's not a very popular opinion.

Isn't system.cpu.time is a sum of all process.cpu.time on the same machine?

That's true afaik, but I think the reason there is this separation in the first place is because there's the host resource where you get system-wide metrics reported, and the process resource where you get metrics for each individual process. The usage of both of those is different, and the ability to specify per-process metrics is still a very common and important use case.

Thank you for the context!

First, I'm not trying to push into any specific direction. I'd be happy with any outcome that would minimize confusion and duplication.

If we look into process/system metrics:

both are reported from inside the system and are based on OS measurements
- this suggests that the component that records them should probably be the same
some metrics are per-process (they are in the process namespace)
- arguably, they should not come from within the process - we have runtime-specific metrics for it (jvm.*, go, etc)
some metrics are machine/system-wide

Would we be able to unify them?

system.cpu.time would be across all processes on the machine and would have process.pid as a attribute.
someone could derive a system-wide cpu time as a sum across all processes for the same host.

It should be fine to start using resource attributes as attributes on metrics - today we just imply them, but still, without pid attribute (or service,instance.id), process metrics are not useful.

Would we still need to provide a cheaper cross-machine cpu time metric convenience in case someone doesn't want to have per-process metrics? Maybe. We can do it by making process.pid disable-able and reusing the same system.cpu.time metric.

There would be metrics that won't have pid as an attribute, e.g number of active|started|stopped processes - they'd happily stay under system namespace without pid.

Some metrics could have required process.pid attribute if they don't make sense across the machine.

What problems would it solve:

as a user I don't need to wonder which metrics I should collect: process, system, both?
the boundary between runtime/process/system will be more clear (runtime - inside the process, system from outside the process)
it'll be clear that we don't need system.linux.* and process.linux.* We can just use linux and windows for OS-specific metrics
we'll have less duplications and inconsistencies in the semconv overall

I'm sure there could be some problems with this approach too. Happy to hear any feedback.

I don't know which way is better; I've personally always been on the side of keeping process and system stuff completely separate and just deal with the repetition, but I sense that's not a very popular opinion.

That's true afaik, but I think the reason there is this separation in the first place is because there's the host resource where you get system-wide metrics reported, and the process resource where you get metrics for each individual process. The usage of both of those is different, and the ability to specify per-process metrics is still a very common and important use case.

If I remember correctly that's the main reasoning that the System Metrics WG has concluded to so far.

system.cpu.time would be across all processes on the machine and would have process.pid as a attribute.

An equivalent Node Vs Pod example would imply to report sth like k8s.cpu.time with k8s.pod.uid attribute 🤔 ? That said, I believe that having system and process namespaces is based on the fact that they are different entities and users are just fine with that.

Maybe. We can do it by making process.pid disable-able and reusing the same system.cpu.time metric.

What would happen if users decide to switch from one option to the other? It's still not clear to me how the options would look like but I guess that could end up being more complicated for users compared to the current distinction?

Also what would be the carnality impact and query load impact from this?

Both are reported from inside the system and are based on OS measurements. this suggests that the component that records them should probably be the same

I disagree on this point. They are both reported from inside the system, but some are about the entire system itself and some are about each individual process. They are describing distinct entities.

some metrics are per-process (they are in the process namespace), arguably, they should not come from within the process - we have runtime-specific metrics for it (jvm.*, go, etc)

Might be misunderstanding this one, but there are a few process-specific metrics that do not apply to runtimes. I also think it's untenable to create semantic conventions for every possible runtime; there should be a generic system-level option. There's lots of precedence for monitoring processes directly from the system as it can be a good fallback.

system.cpu.time would be across all processes on the machine and would have process.pid as a attribute.

Is this to say that these metrics would all be reported under the host resource and each have a process.pid attribute to separate the time series? Unfortunately I don't think this would turn out well. There are quite a few resource attributes for a process, having to spread those out across every single per-process metric would be extremely inefficient compared to having one process resource and recording all the metrics under it.

as a user I don't need to wonder which metrics I should collect: process, system, both?

I would say this is actually a very important decision that we should expect users to make. system metrics are generally 1 set of metrics for 1 resource (the host) which means that they have no growing cardinality, whereas process metrics are 1 set of metrics for N resources, where N is the number of processes on the system. The cardinality is very large and unpredictable. It might be confusing for users; these metrics map very directly with the actual information coming from the system, and that information is on its own hard to understand. But given the cardinality implications it's important that the user can easily understand their options. Adding to that, I think it's much more clear to a user the idea of disabling all metrics under a process namespace to not collector per process metrics, rather than disabling a particular set of process attributes that control cardinality.

the boundary between runtime/process/system will be more clear (runtime - inside the process, system from outside the process)

I think there is a third boundary there.

runtime - inside the process if that process uses some specific runtime with instrumentation
process - measurements from the system about a particular process
system - measurements from the system about the system as a whole

And I think the current semantic conventions maps to these boundaries pretty directly.

I think any direction forward should absolutely keep process as a resource in its own right. It makes sense in its own resource separate from host. However I could see a path forward where certain metrics that are in both are merged. For example, taking these metrics that are in both namespaces:

memory.usage
disk.io
network.io
cpu.time
cpu.utilization
paging.faults

These could be moved into their own either shared namespace or individual namespaces, and then have different meanings when reported under a host resource vs under a process resource. I'm not sure how easy it would be to use semantic convention tooling to generate nice enough documentation that would make it clear what those metrics mean when reported under different resources, but assuming that sort of thing was possible I could see a future where that works out.

Re: resource metrics are reported under

It's not documented in the semantic conventions - at least I don't see any mention of it on the process metrics, and it's not clear if this applies to system metrics Resource attributes related to a host, SHOULD be reported under the host.* namespace.

We should be able to document the attributes applicable to each metric regardless of the unification. If they are not specified, someone could report process metric without adding any resource attributes, or with adding some other non-process ones.

By documenting specific attributes we'd also make the cardinality of the metric clear.

So, if we explicitly document the attributes we expect to have on this metrics, we could also explain that it does not matter how these attributes are populated (as resource attributes or regular metrics attributes).

With this the attachment to resource no longer applies.

E.g.

process.cpu.time should have at least three attributes:

cpu.state|mode
process.pid
host.id

now system is the same metric but without the pid system.cpu.time

cpu.state|mode
host.id

Re: boundaries and who measures things

I don't understand the boundary between runtime and process from semantic convention standpoint.

E.g. if I'm designing .NET runtime metrics, should I report process.cpu.time or dotnet.cpu.time? The answer we have from java is the latter (since cross-runtime unification is almost impossible).

Or maybe both so that someone could build cross-language dashboard and alerts? Could/should I report them from the same instrumentation inside the process? Then the resource they are attached to is a random things users decided to configure which may or may not include host, process, etc. If I report process metrics from inside the process, do I report just this process or all processes in the system? What if I use collector?

user experience

The current path to success seem to look like:

learn about your runtime-specific metrics (we have java and go, JS and .NET are coming)
learn about process metrics.
- learn how are they different from runtime ones
- Would you use instrumentation from the collector or from the contrib repo for your language?
learn about system metrics
- ...
learn about container/k8s metrics
- ...

To decide what your need, you have to

turn it all on
discover that you were expected to configure host and process resources (if you report from inside the process) and without them your process/system metrics are useless
look at the cardinality, costs, etc
turn off what you don't like
rinse and repeat.

I agree that some of this is inevitable, but as I user I would not like the lack of clarity and no simple default experience I can start with.

It's not documented in the semantic conventions - at least I don't see any mention of it on the process metrics, and it's not clear if this applies to system metrics.

It definitely should be. The intention is definitely for all metrics in the process metrics document to be expected as being reported under this process resource. I can make that change, assuming there is a way to do that with the semconv tooling.

I don't understand the boundary between runtime and process from semantic convention standpoint.

In my eyes they are completely different, but given what we have actually written today I can see it's not very clear.

The resource attributes and metrics in the process namespace are intended to map directly to the concepts of a process in an operating system. These metrics aren't intended to be reported by a process itself. Instrumentation that uses these metrics should realistically be system-level instrumentation that uses the OS's facilities for reading information about all processes on the system. As such, these metrics in the process namespace are designed exclusively for being reported under a process resource, which contains other useful information about that process on the OS.

This much isn't clear from current docs generated from semconv yaml, I don't know if it used to be with the handwritten docs. Is there a way to make this more clear using tooling in a way we aren't currently, or should I write something manually somewhere to make it more clear?

User experience

I think with above clarifications that are currently missing from semconv docs, the experience is much more straightforward:

I want to get metrics from my operating system
Do I want to get information about each process on my operating system?
Evaluate cardinality cost
Enable/disable metrics as needed, maybe switch to a subset of processes or decide that the per-process metrics aren't useful for your usecase

I don't see how container and runtime metrics are intertwined with these decisions. They seem separate. If the user is using particular runtimes or using containers, then they should use special instrumentation for those. but the instrumentation for system and process metrics are generally OS-level, like the hostmetricsreceiver in the OTel Collector.

On the semconv yaml definitions and tooling:

You can just list the attributes that should be reported on metric. There is no way to say that metric should be reported under specific resource, and it would not be precise enough anyway.

I.e. if someone specified process.executable.name and process.owner, it would not identify process uniquely.

To build dashboard we'd need at least process.pid there (+ executable name and maybe other things). But having all process attributes is not necessary either.

There is no separation between resource vs regular attributes on semantic conventions. Also if someone wants to report the metric and add attributes explicitly on each measurement instead of using resources, this would be totally fine.

I think having those specified would be a great improvement.

The resource attributes and metrics in the process namespace are intended to map directly to the concepts of a process in an operating system. These metrics aren't intended to be reported by a process itself.

I think this should also be mentioned in semconv - that OTel language SIGs are not expected to have process/system instrumentation libraries.

But we have plenty of them already:

As instrumentation libraries they leave it up to the user to configure resources.

Tagging @open-telemetry/javascript-maintainers @open-telemetry/dotnet-contrib-maintainers @open-telemetry/ go-maintainers in case they have any thoughts or feedback wrt process vs runtime metrics and the future of process instrumentation.

User experience

What I'm offering seems similar:

by default I'd prefer to have
- runtime metrics reported from my instrumented process about this process only
- if runtime metrics are not defined for my runtime, I probably want process metrics for my instrumented process only
- system-wide metrics aggregated across all processes
If I want more:
- metrics for each process in my system:
- I can enable process.* metrics (or enable pid attribute on all relevant system metrics)
- metrics for specific (but not all processes) in my system:
- I can enable process.* metrics (or enable pid attribute on all relevant system metrics). This is a good case to keep process and system metrics separate because I might not want all processes and then I can't aggregate. Still it could be possible to report "other" pid as a bulk sum of all untracked processes.

So you start from a safe (hopefully documented) default and you add details.

The process vs runtime still concerns me - we're doing cpu/memory duplication by design and forcing users to build different alerts/dashboards for each runtime whether they care about differences or not.

I'd prefer the default to be:

runtime metrics reported from my instrumented process about this process only. Runtime metrics in general don't duplicate process ones, but some level of duplication is tolerable
process metrics are reported as well from my instrumented process about this process only - this mostly aligns with existing instrumentation libraries
system-wide metrics aggregated across all processes

Container vs system metrics

They have a certain level of duplication (cpu, memory), the key difference is where you observe these metrics from. As a user I might be able to record both and effectively I'd need to pick one or another to build my alerts/dashboards/etc on.

But we have plenty of them already

Thanks for this, I was definitely incorrect when I said: These metrics aren't intended to be reported by a process itself. There's clear precedent for it that I wasn't aware of, so my previous mention can be disregarded.

I guess this probably works out most of the time, cause the metrics are reported under whatever resource is instrumented, so the metrics are probably typically reported under some manner of application resource that makes it obvious what those metrics are for even though they aren't particularly under a process resource. So I backpedal my previous statement; it makes sense that these metrics could be reported by a process instrumenting itself to read its own stats from the OS.

There is still a difference between these process.* metrics and the associated runtime metrics, that being where the info is read from. In each of the cases from the SDKs above, the information for the process metrics comes from the OS directly. In the case of jvm metrics, they are read through JMX (at least according to these docs, hopefully that's not off base). So even though there are some duplicate metrics, because the different runtimes usually provide some ways to get the same information as you could from the OS, they are still distinct from one another due to their source.

So I think they are different, but there probably is still a way for there to just be a memory metric namespace, and the JVM or Process instrumentation could use the same memory metric definitions from the shared namespace. I think the challenge there is when there might be metrics with similar names, but that mean different things when used in different contexts. Taking an example like memory.usage:

If reported by the Go runtime, this would likely be calculated by the heap usage stats
If reported from the OS, this would be a different number because it would be the actual memory usage of the entire process, not just the Go heap

In this scenario, the meaning of memory.usage is different depending on the reporting source. I think this is the type of thing that would show up repeatedly in the scenario where we are trying to unify these metrics. If we are okay with finding a nice way to document these based on the reporting source then it could work, but we already have a separation based on reporting source:

system -> The metric is data about the root system
process -> The metric is data about the process
container -> The metric is data about a container
runtimes -> The metric is data reported by a runtime

Given that these namespaces probably still need to continue to exist due to having certain metrics that won't be shared, it is probably easier in the long run to keep duplicate named metrics in each namespace because in some scenarios they do mean something quite different based on the context that particular metric point is reported for.

There is no way to say that metric should be reported under specific resource

That's kind of disappointing actually. I think I understand why, but it is too bad for the process metrics, which are sort of designed to be reported under a process resource, like how they are currently reported by the collector's hostmetricsreceiver. That means it's definitely a shortcoming in the current semconv definition of those metrics though, as they don't make mention of any of those process attributes due to being designed for reporting under a process resource. We (system semconv group) will have to find some way to add these attributes to the metric, but I guess make their requirement conditional on the presence of a particular parent resource?

I notice in the instrumentation examples you provided they don't add any identifying attribute like process.pid even though the parent resource isn't a process, but they are still effectively identified provided the manual instrumentation has some resource that the user configured themselves. So given that, maybe the attributes are just added all as optional. :thinking:

The process vs runtime still concerns me - we're doing cpu/memory duplication by design and forcing users to build different alerts/dashboards for each runtime whether they care about differences or not.

The example I gave above on the difference between memory usage reported by the Go runtime vs by the OS for the process is one counterexample to support these things remaining separate. The duplication in names doesn't always imply that they are duplicating the exact same value. Sometimes it does; on Linux, a container runtime reading metrics from cgroup stats is usually roughly the same number as the stats you might get from procfs for example.

Unfortunately I don't have enough expertise in all the runtimes and their metrics to say if there are more counterexamples. If this counterexample with memory usage in particular is the only one, or if there are very few, then maybe the unification would be fine and we deal with the prickly differences one by one.

For what it's worth, we discussed this in the System Semantic Conventions meeting today. We generally agreed we think it is still worth keeping the metrics in namespaces system, process,container`, and each respective runtimes due to:

The potential for minute differences between the meanings of seemingly identical metrics between the different contexts
The namespaces also semantically represent the reporting source, making query scenarios more clear (i.e. "I want all my operating system process metrics" or "I want all my jvm metrics" has a clear separation due to the metrics reported from each source all having their respective namespaces)

I'd welcome additional feedback from the other @open-telemetry/semconv-system-approvers folks.

In this case, I think the namespace is key to easily identify similar metrics, but that have been computed differently because of their source. Even some signals have the same suffix (e.g *.cpu.time), they might have different meaning depending on the source. The namespace should identify and explain those differences. For example, a container can be seen as one or multiple processes, but the key difference from "system" processes is the underlying technology that manages them. Containers rely on the cgroups, which offer a range of capabilities beyond those provided by traditional kernel process management (system). For example, cgroups containers CPU time is determined by dividing the cgroup's CPUShares by the total number of defined CPUShares on the system. As cpu_shares being specific to cgroups, when creating alerts/dashboards for container.cpu.time, cgroup capabilities should be taken into account, differently than process.cpu.time that is not aware of the same cpu resource limiting techniques. Also, as cgroups being a newer technology, container.cpu.time could be reported as nanoseconds instead of the current seconds precision (different metric).

@open-telemetry/semconv-system-approvers is there any conclusion on this which would result in changing the existing model? Otherwise we can close this if there is no majority towards these changes.

open-telemetry / semantic-conventions

Guidance needed: `process` vs `system` vs `container` vs `k8s` vs runtime metrics #1161