kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.33k stars 1.99k forks source link

CustomResourceDefinitions status fields cause spam of errors that cannot be fixed #2482

Open roeoo opened 3 weeks ago

roeoo commented 3 weeks ago

What happened: Spam of errors that look like this:

"kube_customresource_phase" err="[status,phase]: expected value for path to be string, got <nil>"

What you expected to happen:

There should be no errors logged. Status fields are not guaranteed to exist at resource creation. The behavior is not consistent with known types where a default value is taken.

How to reproduce it (as minimally and precisely as possible):

Create cr-config.yaml:

kind: CustomResourceStateMetrics
spec:
  resources:
  - groupVersionKind:
      group: samplecontroller.k8s.io
      kind: "Foo"
      version: v1alpha1
    labelsFromPath:
      name: [metadata, name]
      namespace: [metadata, namespace]
    metricNamePrefix: "cr"
    metrics:
    - name: replicas
      each:
        type: Gauge
        gauge:
          path: [status, availableReplicas]
          nilIsZero: true
    - name: test
      each:
        type: StateSet
        stateSet:
          labelName: phase
          path: [status, phase]
          list:
            - Pending
            - Provisioning
            - Provisioned
            - Running
            - Deleting
            - Deleted
            - Failed
            - Unknown

Create a CRD with status and a valid object. Do not run a controller (this is one of possible scenarios).

kubectl apply -f https://raw.githubusercontent.com/kubernetes/sample-controller/master/artifacts/examples/example-foo.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes/sample-controller/master/artifacts/examples/crd.yaml

Run:

go run main.go --custom-resource-state-only --custom-resource-state-config-file cr-config.yaml --kubeconfig ~/.kube/config

The error repeats for every instance of a resource, and there can be thousands of such resources.

registry_factory.go:685] "cr_test" err="[status,phase]: expected value for path to be string, got <nil>"

Anything else we need to know?:

I believe is a general problem for all CRDs and all status fields. Since there can be many differing objects, the error isn't helpful enough. Might be useful to log this only in verbose mode with resource name and kind.

Environment: kind or any other Kubernetes cluster

roeoo commented 3 weeks ago

Actually nilIsZero might be a better solution giving users some control. It would set all states to zero by default similarly to gauges.

roeoo commented 2 weeks ago

After some more thinking, nilIsZero makes sense only when used with a singular gauge where all labels are known upfront. For generated labels, there is no way to create a sensible metric when some labels are missing.

Thus it should be a normal case for series to not exist, not an error.

dashpole commented 1 week ago

/assign @rexagod /triage accepted