Closed a-thaler closed 1 month ago
A simple test using kube-state-metrics proved that you can emit metrics in a consistent way across all modules. For that the following kube-state-metrics configuration was used:
Running KSM with that config exposed following metrics:
Hereby, we could use a gauge as well instead of a stateset to not differentiate the states but just have an aggregated error or nor error
A very simple dashboard in Cloud Logging on base of the data:
In the otel-collector community the analogue receiver for KSM is the k8sclusterreceiver which has already a good coverage of metrics. However, there is no general solution yet to scrape CRD specific metrics comparable to KSM. When going with the outlined idea we need to see if we would deploy KSM just for that use case or implement some custom receiver for now. We could start writing a generic receiver for that and try to contribute it as well.
This issue has been automatically marked as stale due to the lack of recent activity. It will soon be closed if no further activity occurs. Thank you for your contributions.
The following extension for the MetricPipeline
input section was proposed in the developed concept:
apiVersion: telemetry.kyma-project.io/v1alpha1
kind: MetricPipeline
metadata:
name: sample
spec:
input:
kyma:
enabled: true
modules:
- telemetry
Enabling the input should product the following metrics:
kyma.module.status.state
with the attributes state
and name
, which has the value 1 if the module state is Ready
kyma.module.status.condition
with the attributes reason
, status
, name
, type
, which has the value 1 if the state of the corresponding condition is True
.
The name
attribute for both of the metrics indicates the module name.
Conceptual phae is finished and we will start working on the topic. Target is Q3/24.
One problem which turned out while putting the final pieces together are the RBAC settings. In order to access all modules in a dynamic way, the manager will require "list" permissions on all resources (originated by CRDs, not standard K8S types) with ClusterRole scope.
Furthermore, it currently is not transparent on what the future of the module status is and from where to retrieve the information on available modules and where to find the status. Until that is sorted out, we will continue with the feature by focussing only on the telemetry module. Here, the contract is under control and the RBAC is fully fine (does not require any wildcard).
We agreed on the following points:
With that, the following items need to be done to finish that epic:
Rolled out with 1.25.0
Problem
Every module in Kyma must report a status in some way which can be introspected by users. A module already can expose custom metrics on components and mark them with
prometheus.io/scrape
annotation as scrapable, so that users have a chance to get insights. With that approach, modules can expose advanced metric about the module where users need to know the metrics and be able to define thresholds in order to define alerts. For the not so much "advanced" scenario it will be helpfull to have metrics available which are harmonized across all modules and have a very simple threshold like "error" or "no error". That simple metric should be available if modules do not care yet about metric exposure. The user needs a way to collect these metrics so that he can have a unified dashboard and alert rules defined in his backendCriterias
Idea Every module currently must reflect the current state in the moduleCR status by having a "state". It is recommended to also have some more advanced "conditions" with reasons available in the status like for example in telemetry:
Also the state of the module is reflected in the Kyma CR itself as well as the overall kyma state, like shown in the shortened example:
To reflect that status information via custom module metrics would require additional effort and an harmonized approach (metric syntax and semantics) across all modules, which will be very hard to achieve.
Instead we could offer a dedicated input to a MetricPipeline which will provide metrics for the kyma state itself and the state of all modules, based on the Kyma CR plus metrics for representing the individual module conditions. The metrics will be gauges with simple values of 0 or 1 for easy alerting. The relation to the used moduleCRs are available via the kyma status already.
An Example PIpeline can look like this:
Example metrics can look like that:
Items: