kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.2k stars 1.92k forks source link

Flux custom metrics monitoring broken in 2.12 #2386

Open foslage opened 1 month ago

foslage commented 1 month ago

We are using Flux CD and have set up custom metrics for monitoring. The config was copied from the flux2-monitoring-examples.

What happened:

After updating to 2.12.0 we are missing several gotk_resource_info metrics.

The remaining gotk_resource_info metrics are exclusively for the custom resource helmreleases.helm.toolkit.fluxcd.io:

gotk_resource_info{customresource_group="helm.toolkit.fluxcd.io",customresource_kind="HelmRelease", ...}
...

What you expected to happen:

We should also see metrics for other custom resources, like this:

gotk_resource_info{customresource_group="helm.toolkit.fluxcd.io",customresource_kind="HelmRelease", ...}
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmChart", ...}
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository", ...}
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="GitRepository", ...}
...

That's how it was in 2.11 and downgrading to 2.11.0 restores these metrics.

Workaround & possible cause:

It seems the issue is caused because the kube-state-metrics config want's to compile all CRD metrics into a single metric name (gotk_resource_info) and that is no longer possible with 2.12.

If we use a dedicated metric name for each CRD type, eg. gotk_resource_info for helmreleases.helm.toolkit.fluxcd.io and gotk_resource_info2 for helmcharts.source.toolkit.fluxcd.io, the metrics show up correctly:

gotk_resource_info{customresource_group="helm.toolkit.fluxcd.io",customresource_kind="HelmRelease", ...}
gotk_resource_info2{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmChart", ...}
gotk_resource_info3{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository", ...}
gotk_resource_info4{customresource_group="source.toolkit.fluxcd.io",customresource_kind="GitRepository", ...}
...

Since I don't see any mention of this in the release logs I assume this is not by design. If it was KSM should output an error message if a config with non-unique names is supplied.

Environment:

dgrisonnet commented 1 month ago

/assign @rexagod /triage accepted

rexagod commented 1 month ago

I haven't taken a deep look, but the issue seems to be resolved: https://github.com/fluxcd/flux2-monitoring-example/issues/32#issuecomment-2059346695? If not, let's continue this over at https://github.com/kubernetes/kube-state-metrics/issues/2366, in order to have all discussions regarding this issue in the same place, if that makes sense.