Revisit process metrics

braydonk commented 10 months ago

Pursuant to #330 which should be merged soon, I am opening this issue to revisit the process metrics schema while we are making breaking changes anyway.

As of writing, the only changes to the process schema will be to namespace every attribute. There are a couple of changes that should be discussed among the Hostmetrics WG to improve the schema.

Following are the current points up for discussion:

Should the attributes be considered exhaustive?
process.threads -> process.thread.count
process.context_switches -> process.context_switch.count
process.network.io.direction, transmit -> send
process.cpu.utilization, do we have a better way to indicate that this metric is intended to be computed from another?

Please comment if you have any additional points to discuss. I will comment my own opinion on each below so I don't bias the root issue post with my own opinions. If any of these need a larger discussion I will spin them into child issues.

braydonk commented 10 months ago

Should the attributes be considered exhaustive?

I think each attribute in the current process metrics schema is exhaustive:

both direction attributes have their respective in and out equivalents
process.paging.faults.type, is there such thing other than a major or minor page fault?
process.context_switches.type, voluntary and involuntary on their own are grammatically exhaustive, and I can't imagine any other value for this.
process.cpu.state is the only one where it could be up in the air for some super niche edge case; I can't think of any CPU state other than the existing system, user, and wait personally but would love to learn if there is something

braydonk commented 10 months ago

`process.threads` -> `process.thread.count`

This was discussed in https://github.com/open-telemetry/semantic-conventions/pull/330/files#r1330094680

While I think it makes sense to merge with ECS where possible, I'm not sold on the rename to process.threads.count for 2 reasons:

Every metric in the process.threads namespace in the linked ECS docs all seem to give information about a single process thread "resource" so to speak. This metric is about the thread count of a process "resource", so I fear this metric being in a process.thread namespace is confusing.
The guidance in #260 could be relevant for this, however this is a non-monotonic counter. Does the guidance in #260 apply in the negative sense as well, where metrics should only be named count iff they are a monotonic counter?

Truthfully I'm not caught up with the plans for where we're merging with ECS and how we reconcile the differences, so I'd be happy to hear from the ECS stakeholders about how we should handle this.

braydonk commented 10 months ago

`process.context_switches` -> `process.context_switch.count`

This is related to #260 guidance; this is a monotonic counter, so process.context_switch.count seems like a logical progression of this metric name to match with that.

braydonk commented 10 months ago

`process.network.io.direction`, `transmit` -> `send`

This is mostly my own idea, I'm not married to it. However, I think send is a better attribute value than transmit for two reasons:

It's a closer logical opposite to receive
It matches with existing nomenclature from socket APIs

braydonk commented 10 months ago

`process.cpu.utilization`, do we have a better way to indicate that this metric is intended to be computed from another?

This metric is computed from process.cpu.time and (insert name of CPU count metric here). Is there some better way to indicate within the semconv generation that this metric is indeed intended to be computed from other known values? At the moment it's only outlined in the description, and not from any actual semantic convention terminology.

ChrsMark commented 7 months ago

Every metric in the process.threads namespace in the linked ECS docs all seem to give information about a single process thread "resource" so to speak. This metric is about the thread count of a process "resource", so I fear this metric being in a process.thread namespace is confusing.

I see the point @braydonk ! I agree that process.threads.count (violates the namespace pluralization rule) or process.threads is the accurate one in that case.

@AlexanderWert @trisch-me feel free to add your input here.

trisch-me commented 7 months ago

I also do agree that existing ECS schema for process.thread is about thread itself and process.threads is about process itself having multiple threads and therefore they are having different semantic meaning. process.thread already exist in ECS and therefore we should be careful when introducing new naming conflicting with it.

Truthfully I'm not caught up with the plans for where we're merging with ECS and how we reconcile the differences, so I'd be happy to hear from the ECS stakeholders about how we should handle this.

We are currently actively adding new fields from ECS into Otel. Please also check this PR about consideration when adding new fields which might exist already in ECS

mx-psi commented 6 months ago

Discussed on January 18th System Semantic Conventions WG meeting, @braydonk to update after discussion in #330

dmitryax commented 2 months ago

@braydonk are we good to proceed on this issue or it's still blocked?

braydonk commented 2 months ago

Yeah this should be good to proceed. Here's the current state of things:

Decisions Made

process.threads -> process.thread.count

Yes
It is an updowncounter so this new name follows the naming rule, done in #330.

process.context_switches -> process.context_switch.count

No
It is a counter so it remains as is to follow the naming rule.

process.cpu.utilization, do we have a better way to indicate that this metric is intended to be computed from another?

No
As long as I'm not mistaken, there isn't a tooling solution for this and the best way seems to be that this is called out in the description of the metric. Currently it sort of is: https://github.com/open-telemetry/semantic-conventions/blob/c316b1fb8747b8384b738806cb5068a9bf839f8c/model/metrics/process-metrics.yaml#L19-L21

Should process.cpu.state be exhaustive?

No
The metric is being merged into a single cpu.state metric, where the list of possible values will be wider than the 3 expected to be used in the process metrics. See #1026.

No Decision Yet

Should the other attributes in this comment be made exhaustive?

I'm tempted to say no since these metrics have now been moved to the registry and it makes less sense to be prescriptive even though these metrics will likely only have these values when used in process metrics.

network.io.direction value transmit -> send

I think I still agree with my reasoning in the comment above, but I'm starting to lean towards not rocking the boat and leaving that where it is. I think I'm splitting hairs too much with it and am fine to leave it as is.

open-telemetry / semantic-conventions