kyma-project / telemetry-manager

Manager for the Kyma telemetry module
https://kyma-project.io/#/telemetry-manager/user/README
Apache License 2.0
5 stars 24 forks source link

Telemetry module status as metric input to enable dashboarding and alerting on it #728

Open a-thaler opened 8 months ago

a-thaler commented 8 months ago

Update The scope got adjusted to expose only data about the telemetry module in this epic, and not for all modules. As soon as it is clarified how to get the dynamic list of modules and it's status, a follow up gets created (https://github.com/kyma-project/telemetry-manager/issues/1389) to enable support for. all modules. The implementation here will be generic already for supporting more modules.

Problem

Every module in Kyma must report a status in some way which can be introspected by users. A module already can expose custom metrics on components and mark them with prometheus.io/scrape annotation as scrapable, so that users have a chance to get insights. With that approach, modules can expose advanced metric about the module where users need to know the metrics and be able to define thresholds in order to define alerts. For the not so much "advanced" scenario it will be helpfull to have metrics available which are harmonized across all modules and have a very simple threshold like "error" or "no error". That simple metric should be available if modules do not care yet about metric exposure. The user needs a way to collect these metrics so that he can have a unified dashboard and alert rules defined in his backend

Criterias

Idea Every module currently must reflect the current state in the moduleCR status by having a "state". It is recommended to also have some more advanced "conditions" with reasons available in the status like for example in telemetry:

  status:
    conditions:
    - lastTransitionTime: "2024-01-18T09:45:25Z"
      message: Fluent Bit DaemonSet is ready
      reason: FluentBitDaemonSetReady
      status: "True"
      type: LogComponentsHealthy
    - lastTransitionTime: "2024-01-17T21:09:22Z"
      message: Trace gateway Deployment is ready
      reason: TraceGatewayDeploymentReady
      status: "True"
      type: TraceComponentsHealthy
    - lastTransitionTime: "2024-01-16T14:44:54Z"
      message: One or more referenced Secrets are missing
      reason: MetricPipelineReferencedSecretMissing
      status: "False"
      type: MetricComponentsHealthy
    state: Warning

Also the state of the module is reflected in the Kyma CR itself as well as the overall kyma state, like shown in the shortened example:

  status:
    activeChannel: fast
    conditions:
    - lastTransitionTime: "2024-01-18T12:22:14Z"
      message: not all modules are in ready state
      reason: Ready
      status: "False"
      type: Modules
    modules:
    - channel: experimental
      fqdn: kyma-project.io/module/telemetry
      name: telemetry
      state: Warning
      version: 1.7.0-dev
      resource:
        apiVersion: operator.kyma-project.io/v1alpha1
        kind: Telemetry
        metadata:
          name: default
          namespace: kyma-system
    state: Warning

To reflect that status information via custom module metrics would require additional effort and an harmonized approach (metric syntax and semantics) across all modules, which will be very hard to achieve.

Instead we could offer a dedicated input to a MetricPipeline which will provide metrics for the kyma state itself and the state of all modules, based on the Kyma CR plus metrics for representing the individual module conditions. The metrics will be gauges with simple values of 0 or 1 for easy alerting. The relation to the used moduleCRs are available via the kyma status already.

An Example PIpeline can look like this:

apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
  name: icke
spec:
  input:
    kyma:
      enabled: true

Example metrics can look like that:

kyma_status_state{version="v1beta2", state="running"|"warning"|"error"} = 1
kyma_status_modules_state{version="v1beta2", state="running"|"warning"|"error"} = 1
kyma_telemetry_status_conditions{version="v1alpha1", type="LogComponentsHealthy", reason="Running"} = 1

Items:

a-thaler commented 8 months ago

A simple test using kube-state-metrics proved that you can emit metrics in a consistent way across all modules. For that the following kube-state-metrics configuration was used:

```yaml customResourceState: enabled: true config: kind: CustomResourceStateMetrics spec: resources: - groupVersionKind: group: "operator.kyma-project.io" kind: "Kyma" version: "v1beta2" labelsFromPath: name: [metadata, name] namespace: [metadata, namespace] metrics: - name: kyma_status_state help: "current state of kyma" each: type: StateSet stateSet: labelName: state path: [status,state] list: [Error, Processing, Ready, Deleting, Warning] - name: kyma_status_modules_state help: "current module states" each: type: StateSet stateSet: labelName: state valueFrom: [state] path: [status, modules] labelsFromPath: module: [name] list: [Error, Processing, Ready, Deleting, Warning] - groupVersionKind: group: "operator.kyma-project.io" kind: "*" version: "*" labelsFromPath: name: [metadata, name] namespace: [metadata, namespace] metrics: - name: module_status_conditions help: "conditions of Module CR" each: type: Gauge gauge: path: [status, conditions] labelsFromPath: type: [type] reason: [reason] valueFrom: [status] ```

Running KSM with that config exposed following metrics:

```yaml # HELP kube_customresource_module_status_conditions conditions of Module CR # TYPE kube_customresource_module_status_conditions gauge kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="ApplicationConnector",customresource_version="v1alpha1",name="applicationconnector-sample",namespace="kyma-system",reason="Verified",type="Installed"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="BtpOperator",customresource_version="v1alpha1",name="btpoperator",namespace="kyma-system",reason="ReconcileSucceeded",type="Ready"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Eventing",customresource_version="v1alpha1",name="eventing",namespace="kyma-system",reason="Available",type="NATSAvailable"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Eventing",customresource_version="v1alpha1",name="eventing",namespace="kyma-system",reason="Deployed",type="PublisherProxyReady"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Eventing",customresource_version="v1alpha1",name="eventing",namespace="kyma-system",reason="Ready",type="WebhookReady"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Keda",customresource_version="v1alpha1",name="default",namespace="kyma-system",reason="Verified",type="Installed"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="NATS",customresource_version="v1alpha1",name="eventing-nats",namespace="kyma-system",reason="Available",type="StatefulSet"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="NATS",customresource_version="v1alpha1",name="eventing-nats",namespace="kyma-system",reason="Deployed",type="Available"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Serverless",customresource_version="v1alpha1",name="default",namespace="kyma-system",reason="Configured",type="Configured"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Serverless",customresource_version="v1alpha1",name="default",namespace="kyma-system",reason="Installed",type="Installed"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Telemetry",customresource_version="v1alpha1",name="default",namespace="kyma-system",reason="FluentBitDaemonSetReady",type="LogComponentsHealthy"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Telemetry",customresource_version="v1alpha1",name="default",namespace="kyma-system",reason="MetricPipelineReferencedSecretMissing",type="MetricComponentsHealthy"} 0 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Telemetry",customresource_version="v1alpha1",name="default",namespace="kyma-system",reason="TraceGatewayDeploymentReady",type="TraceComponentsHealthy"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta1",name="default",namespace="kyma-system",reason="Ready",type="ModuleCatalog"} 1 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta1",name="default",namespace="kyma-system",reason="Ready",type="Modules"} 0 kube_customresource_module_status_conditions{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta1",name="default",namespace="kyma-system",reason="Ready",type="SKRWebhook"} 1 # HELP kube_customresource_kyma_status_state current state of kyma # TYPE kube_customresource_kyma_status_state stateset kube_customresource_kyma_status_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",name="default",namespace="kyma-system",state="Ready"} 0 kube_customresource_kyma_status_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",name="default",namespace="kyma-system",state="Warning"} 1 # HELP kube_customresource_kyma_status_modules_state current module states # TYPE kube_customresource_kyma_status_modules_state stateset kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="api-gateway",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="api-gateway",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="api-gateway",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="api-gateway",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="api-gateway",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="application-connector",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="application-connector",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="application-connector",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="application-connector",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="application-connector",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="btp-operator",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="btp-operator",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="btp-operator",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="btp-operator",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="btp-operator",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="eventing",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="eventing",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="eventing",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="eventing",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="eventing",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="istio",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="istio",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="istio",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="istio",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="istio",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="keda",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="keda",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="keda",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="keda",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="keda",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="nats",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="nats",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="nats",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="nats",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="nats",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="serverless",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="serverless",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="serverless",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="serverless",name="default",namespace="kyma-system",state="Ready"} 1 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="serverless",name="default",namespace="kyma-system",state="Warning"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="telemetry",name="default",namespace="kyma-system",state="Deleting"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="telemetry",name="default",namespace="kyma-system",state="Error"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="telemetry",name="default",namespace="kyma-system",state="Processing"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="telemetry",name="default",namespace="kyma-system",state="Ready"} 0 kube_customresource_kyma_status_modules_state{customresource_group="operator.kyma-project.io",customresource_kind="Kyma",customresource_version="v1beta2",module="telemetry",name="default",namespace="kyma-system",state="Warning"} 1 ```

Hereby, we could use a gauge as well instead of a stateset to not differentiate the states but just have an aggregated error or nor error

A very simple dashboard in Cloud Logging on base of the data:

Screenshot 2024-01-19 at 13 21 46
a-thaler commented 8 months ago

In the otel-collector community the analogue receiver for KSM is the k8sclusterreceiver which has already a good coverage of metrics. However, there is no general solution yet to scrape CRD specific metrics comparable to KSM. When going with the outlined idea we need to see if we would deploy KSM just for that use case or implement some custom receiver for now. We could start writing a generic receiver for that and try to contribute it as well.

github-actions[bot] commented 6 months ago

This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. Thank you for your contributions.

chrkl commented 4 months ago

The following extension for the MetricPipeline input section was proposed in the developed concept:

apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
  name: sample
spec:
  input:
    kyma:
      enabled: true
      modules:
        - telemetry

Enabling the input should product the following metrics:

kyma.module.status.state with the attributes state and name, which has the value 1 if the module state is Ready kyma.module.status.condition with the attributes reason, status, name, type, which has the value 1 if the state of the corresponding condition is True. The name attribute for both of the metrics indicates the module name.

a-thaler commented 3 months ago

Conceptual phae is finished and we will start working on the topic. Target is Q3/24.

a-thaler commented 1 month ago

One problem which turned out while putting the final pieces together are the RBAC settings. In order to access all modules in a dynamic way, the manager will require "list" permissions on all resources (originated by CRDs, not standard K8S types) with ClusterRole scope.

Furthermore, it currently is not transparent on what the future of the module status is and from where to retrieve the information on available modules and where to find the status. Until that is sorted out, we will continue with the feature by focussing only on the telemetry module. Here, the contract is under control and the RBAC is fully fine (does not require any wildcard).

a-thaler commented 3 weeks ago

We agreed on the following points:

With that, the following items need to be done to finish that epic: