open-telemetry / opentelemetry-collector-contrib

Contrib repository for the OpenTelemetry Collector
https://opentelemetry.io
Apache License 2.0
3.13k stars 2.4k forks source link

[k8sclusterreceiver] k8s.node.condition metric not aggregatable in current form #33760

Open sirianni opened 5 months ago

sirianni commented 5 months ago

Component(s)

receiver/k8scluster

What happened?

Description

The use of -1 for ConditionUnknown greatly hinders usability of the k8s.node.condition metric.

For example, it's not possible to get a simple count of ready nodes in a k8s cluster (since the -1 subtracts from the sum). This would be useful to write an alert comparing k8s.daemonset.ready_nodes to sum(k8s.node.condition{condition="ready"}).

Another example of the Splunk team continuing to push the antipattern of using the metric value to encode enumerations. While this may be usable in the Splunk backend, it simply doesn't work well in most other metric systems (Datadog, New Relic, Prometheus, etc.).

This metric should instead be modeled like the kube_node_status_condition metric from kube-state-metrics which includes status as an attribute following the OpenMetrics StateSet pattern. This allows queries of the form

sum by(condition) (kube_node_status_condition{condition="ready", status="true"})

Collector version

v0.103.0

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

github-actions[bot] commented 5 months ago

Pinging code owners:

TylerHelmuth commented 5 months ago

For example, it's not possible to get a simple count of ready nodes in a k8s cluster

Isn't it possible if filter by the Condition attribute or value == 1? I agree that the -1 makes aggregation across all dimensions not accurate.

sirianni commented 5 months ago

For example, it's not possible to get a simple count of ready nodes in a k8s cluster

Isn't it possible if filter by the Condition attribute or value == 1?

It's not possible to "filter by value" in many systems (e.g. datadog). You can only aggregate values (sum, min, max, avg). I suspect one reason is because values can be pre-aggregated over time and space and therefore you lose the ability to filter at the native ingestion granularity. For example, if I pre-aggregate away the k8s.node.name label, then what happens? What about if I'm viewing this metric at hourly granularity and therefore serving the query from a preaggregated rollup table?

Filtering by the condition attribute doesn't work because the same condition: ready applies to both true (1) and unknown (-1).

github-actions[bot] commented 3 months ago

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.