strimzi / strimzi-kafka-operator

Apache Kafka® running on Kubernetes
https://strimzi.io/
Apache License 2.0
4.74k stars 1.27k forks source link

[Enhancement]: Monitoring of custom resources #10276

Open sebastiangaiser opened 2 months ago

sebastiangaiser commented 2 months ago

Related problem

In https://github.com/strimzi/strimzi-kafka-operator/issues/9802 there was a discussion about how to monitor the state of a custom resources (CR) like KafkaTopic, KafkaUser, ... .

Suggested solution

There was a suggestion to use kube-state-metrics (ksm) to monitor the state of each CR. For deploying the kube-state-metrics, personally I would recommend using the kube-state-metrics Helm chart which is already part of the kube-prometheus-stack (via a sub-chart).

Using ksm relies on the status object of a Kubernetes resource.

Alternatives

No response

Additional context

Flux also handle this via kube-state-metrics.: Flux documentation: https://fluxcd.io/flux/monitoring/custom-metrics/ Flux example repository: https://github.com/fluxcd/flux2-monitoring-example

scholzj commented 1 month ago

Discussed on the community call on 10.7.2024: Looks good to Jakub & Jakub 😉. But should be discussed on the next call with hopefully more maintainers present.

im-konge commented 1 month ago

Discussed on the community call on 25.07.2024: I will check the installation and how we can get the metrics and we will discuss it more next time.

im-konge commented 1 month ago

I did some investigation, however I don't have a list of all options that we can use. If that makes sense, I would keep it for another community call (as I will not be maybe able to join today). Sorry for the issues.

sebastiangaiser commented 1 month ago

@im-konge thank you for investigating.

I don't have a list of all options that we can use

What do you mean by that? If you need input feel free to ping me.

im-konge commented 1 month ago

No problem.

What do you mean by that? If you need input feel free to ping me.

IIRC the questions on the community call were about how we will continue with this, the installation options etc. That's what I meant by that sentence.

So I just tried that now more and yes, one option is to use the Helm chart as you showed on the PR (#10277) or, because we are using mainly YAML files directly in our examples, we can go with the Deployment as mentioned in the https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/extend/customresourcestate-metrics.md .

From here, we need to pass the config to the app - which can be done using --custom-resource-state-config and passing the multiline YAML configuration as showed in https://github.com/kubernetes/kube-state-metrics/blob/main/docs/metrics/extend/customresourcestate-metrics.md#configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: kube-state-metrics
  namespace: kube-system
spec:
  template:
    spec:
      containers:
      - name: kube-state-metrics
        args:
          - --custom-resource-state-config
          # in YAML files, | allows a multi-line string to be passed as a flag value
          # see https://yaml-multiline.info
          -  |
              spec:
                resources:
                  - groupVersionKind:
                      group: myteam.io
                      version: "v1"
                      kind: Foo
                    metrics:
                      - name: active_count
                        help: "Count of active Foo"
                        each:
                          type: Gauge
                          ...

or, which at least FMPOV looks better in terms of maintainability, we can create a ConfigMap with the config itself: https://gist.github.com/im-konge/cbbe1212dfd7a194e6e7b1f421d0d0ed

and then mount it to the Deployment. Once mounted, we will just need to use --custom-resource-state-config-file with path to the mounted config.yaml file.

I guess this follows the way how we are providing some other metrics in our examples (keeping the metrics in ConfigMap).

And thanks @sebastiangaiser for the examples, it really helped when trying these deployment options :).

sebastiangaiser commented 1 month ago

You're welcome, thank you for investigating and explaining :D Feel free to hijack/supersede my PR. I'm really looking forward to see this topic solved.

scholzj commented 2 days ago

Discussed on the community call on 5.9.2024: We did not really have much time to look into this in the detail. But there seem to be many open points and unknown things. So this should probably go through a proposal. The proposal should cover things such as:

Ideally, this should be all covered by the proposal.

@sebastiangaiser Would you be interested to put together the proposal for this? If not, someone else might have a look into it.