Open k15r opened 9 months ago
/triage accepted /assign @CatherineF-dev @rexagod
I came here to open the same issue just to find it's already here.
This issue simply kills the ability to use kind: "*"
or version: "*"
when there are multiple items under the metrics
of that resource.
Here you can find some example manifests and steps to reproduce the issue: https://gist.github.com/bergerx/adad24dcd7cc360e1f36fbb98407b27b
git clone git@gist.github.com:adad24dcd7cc360e1f36fbb98407b27b.git ksm-2223
minikube start
kubectl apply \
-f ksm-2223/crd-bar.example.com.yaml \
-f ksm-2223/crd-foo.example.com.yaml
kubectl apply \
-f ksm-2223/cr-bar.yaml \
-f ksm-2223/cr-foo.yaml
go run main.go --custom-resource-state-only --custom-resource-state-config-file ksm-2223/custom-resource-config-file.yaml --kubeconfig ~/.kube/config
And here is the output:
$ curl localhost:8080/metrics
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.699031755e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 508820
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 1.699031755e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 508819
Prometheus compatible parsers will throw an error like this on line 8:
second TYPE line for metric name ... or TYPE reported after samples
In the example above it's a single resource definition in the custom-resource-state-config file, but the same issue also happens if the same metric name is somehow used for different GVKs, which I believe is also a valid scenario. e.g. we used to have this item under the .spec.resources
repeated for multiple CRDs:
- groupVersionKind:
group: our.internal.group # we have a copy of this whole thing for each internal group
kind: "*"
version: "*"
labelsFromPath:
name: [metadata, name]
namespace: [metadata, namespace]
metricNamePrefix: "cr"
metrics:
- name: status
each:
type: Gauge
gauge:
path: [status, conditions]
labelsFromPath:
type: [type]
valueFrom: [status]
https://github.com/kubernetes/kube-state-metrics/pull/1810 seems to be a related issue.
I can reproduce this issue (metric values are not put together for one same metric) using https://github.com/kubernetes/kube-state-metrics/issues/2223#issuecomment-1792850276.
$ curl localhost:8089/metrics
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.700534671e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 391
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 1.700534671e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 392
QQ: I think the issue is that KSM doesn't put same metric value together. Is it correct? cc @bergerx @k15r
I think the issue is here https://github.com/kubernetes/kube-state-metrics/blob/main/internal/store/builder.go#L210
availableStores[gvrString] = func(b *Builder) []cache.Store {
return b.buildCustomResourceStoresFunc(
f.Name(),
f.MetricFamilyGenerators(),
f.ExpectedType(),
f.ListWatch,
b.useAPIServerCache,
)
}
@CatherineF-dev Thanks for taking care of this issue.
I can reproduce this issue (metric values are not put together for one same metric) using #2223 (comment).
$ curl localhost:8089/metrics
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.700534671e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 391
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 1.700534671e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 392
QQ: I think the issue is that KSM doesn't put same metric value together. Is it correct? cc @bergerx @k15r
In my opinion there are multiple issues shown in your output:
it creates duplicate entries for the same metric:
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
it must look like this as "Only one TYPE line may exist for a given metric name"
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
the metric values differ
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 391
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Foo",customresource_version="v1",name="myfoo"} 392
Here it displays 392 AND 391 for the same metric with exactly the same values. It is not clear which one to use. For clients trying to parse this TEF there is no way to identify the correct value.
guys, could you please update with ETA (if any) for this bug? We are affected by this for Vertical Pod Autoscaler metrics in case multiple containers run in the same pod. (kube-state-metrics CRS are configured accordingly to doc in this PR)
Hi @k15r, could you provide detailed steps to reproduce this issue?
The first issue I want to fix is this:
curl localhost:8089/metrics
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.701828773e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 909919
# HELP cr_creationtimestamp
# TYPE cr_creationtimestamp gauge
cr_creationtimestamp{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 1.701828773e+09
# HELP cr_resourceversion
# TYPE cr_resourceversion gauge
cr_resourceversion{customresource_group="example.com",customresource_kind="Bar",customresource_version="v1",name="mybar"} 909919
Could you try https://github.com/kubernetes/kube-state-metrics/pull/2257 to see whether repeated adding and deleting CustomResourceDefinitions causes duplicate metric entries
is fixed?
I was just trying this feature on v2.10.1
with a type: StateSet
and I think I see this or something very similar(??) or maybe a different issue(??).
With a config such as:
containers:
- args:
- --port=8080
- --resources=certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments
- --telemetry-port=8081
- --custom-resource-state-config
- |
spec:
resources:
- groupVersionKind:
group: "cluster.x-k8s.io"
version: "v1beta1"
kind: "Machine"
metrics:
- name: "cunningr"
help: "Phase of Machines"
each:
type: StateSet
stateSet:
labelName: phase
path: ["status","phase"]
list: ['Provisioned', 'Pending', 'Running', 'Deleting', 'Failed']
Each of my Machine
instances seems to get a new metrics instance:
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Deleting"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Failed"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Pending"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Provisioned"} 1
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Running"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Deleting"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Failed"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Pending"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Provisioned"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Running"} 1
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Deleting"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Failed"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Pending"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Provisioned"} 1
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Running"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Deleting"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Failed"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Pending"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Provisioned"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Running"} 1
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Deleting"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Failed"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Pending"} 0
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Provisioned"} 1
kube_customresource_cunningr{customresource_group="cluster.x-k8s.io",customresource_kind="Machine",customresource_version="v1beta1",phase="Running"} 0
I would have expected those to be aggregated into a single gauge metric for each state?
What happened:
This is part of our kubestate customresource configuration:
after adding and deleting the corresponding CRD and on CR its kind this is a part of the
/metrics
response of kubestatemetrics:as you can see, there are multiple entries for the same metric (#HELP, and #TYPE is mentioned 3 times. Within a single metric block lines are duplicated .
What you expected to happen:
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know?:
For an advanced version of this bug create the following configuration in the cluster:
This configuration puts the metrics of two different CRs into the same metric (
kube_customresource_module_status
)Now if you create both CRDs and a matching CR and repeatedly create and remove one of the CRDs you will get output similar to this (here the sample-CRD was deleted):
Environment:
kubectl version
): v1.26.7