giantswarm / roadmap

Giant Swarm Product Roadmap
https://github.com/orgs/giantswarm/projects/273
Apache License 2.0
3 stars 0 forks source link

Update Grafana Cloud recording rules to work with CAPI clusters #3634

Open QuantumEnigmaa opened 3 months ago

QuantumEnigmaa commented 3 months ago

Some dashboards accessible from Grafana Cloud such as the Clusters or the Customers ones are missing all data related to the CAPI clusters.

After investigating a bit, I found out this is due to the fact that the metrics used in the recording rules sent to grafana cloud (and thus used in those dahboards) are not present on CAPI clusters which have their own equivalent ones.

For example, on vintage clusters there's the cluster_service_cluster_info metric while to have the same output on CAPI clusters one needs to use the capi_cluster_info metric. However, the issue here is that there's no way to get the capi cluster release version as there's no release label in the capi_cluster_info metric.

We thus need to update the grafana cloud recording rules in the prometheus-rules repo in order to cover both vintage and CAPI clusters.

QuantumEnigmaa commented 3 months ago

I didn't find an actual equivalent for the cluster_operator_cluster_create_transition and cluster_operator_cluster_update_transition metrics. The only capi ones that are similar are the capi_cluster_created and capi_cluster_status_condition_last_transition_time ones but they don't really match.

QuentinBisson commented 3 months ago

@giantswarm/team-turtles would you have any idea how those could be mapped to capi metrics?

weseven commented 3 months ago

@giantswarm/team-turtles would you have any idea how those could be mapped to capi metrics?

As far as I know there's no metric in the capi controllers that holds the time spent to create or upgrade a cluster. I guess cluster creation time could be computed as the time the cluster had the metric capi_cluster_status_phase{phase="Provisioning"} = 1. For the updates it's a bit dicey, we would need to check the status subresource of capi resources (cluster, kubeadmcontrolplane, machinepools) to be sure.

Not sure if @nprokopic or @njuettner have better ideas.

QuantumEnigmaa commented 2 months ago

After meeting together with @njuettner, we came out with this solution :

Please @njuettner don't hesitate to correct me if I wrote some :)