ovh / public-cloud-roadmap

Agile roadmap for OVHcloud Public Cloud services. Discover the features our product teams are working on, comment and influence our backlog.
https://www.ovhcloud.com/en/public-cloud/
185 stars 5 forks source link

Monitoring ETCD quota #105

Open Grounz opened 3 years ago

Grounz commented 3 years ago

In some case, ETCD quota exceed, like explain here: https://docs.ovh.com/gb/en/kubernetes/etcd-quota-error/ and add new elements deployed or remove it in cluster doesn't work when this quota is reached. So on production it's mandatory to monitor this.

As kubernetes SRE team i want monitoring etcd storage quota, maybe with exporter metrics for prometheus. And we will generate alertings on this metrics and will know the exceeded quota and we will investigate without need to sollicitate ovh support teams.

Today, we use prometheus-kube-stack operator for monitoring K8S clusters, and it's work right.

Grounz commented 3 years ago

Hi, is that clear ?

mhurtrel commented 3 years ago

Hello @Grounz Yes the issue is clear. I will check with the team if there is a current way to have this information and document it here, else will add this to the backlog, this is definitely a need I agree on.

matheyal commented 2 years ago

Hi @mhurtrel and the OVHcloud team,

Do you have any news on this request?

I've also been experiencing issues with this quota being reached without any way to see it coming. It would be really nice to have a way to monitor this metric.

mhurtrel commented 2 years ago

Hi @matheyal and sorry for the delay on this. We had to tackle other priorities and I will come back here soon to define an ETA.

ddelpha commented 2 years ago

Hello @mhurtrel our team recently experienced the same issue as @matheyal and we are looking for a solution so that our platform does not lock up with an error status again. Any assistance is appreciated.

arcalys commented 2 years ago

Same here Encountered twice on our cluster, and this caused downtime.

mhurtrel commented 2 years ago

Hi @arcalys @ddelpha @matheyal @Grounz I confirm I will get this feature prioritized first hal of 2022, but can't share a precise ETA yet

ddelpha commented 2 years ago

@mhurtrel nice :)

thank you

arcalys commented 2 years ago

Thanks for the info @mhurtrel =)

lilvinz commented 2 years ago

We have hit this issue in production as well. First half of 2022 is nearly over now. Any news / ETA? Thanks!

mhurtrel commented 2 years ago

Hi @lilvinz and thanks for the heads up. This feature will be released this summer, between june and august. Sorry for the delay.

jMonsinjon commented 2 years ago

Hello, We planned to deploy this feature by the end of November. It takes some time to design how to give you an access to a data stored in our "management" perimeter.

More information will come

mhurtrel commented 1 year ago

You can now consult your quota and usage of etcd storage for each clustet though the API endpoint : https://api.ovh.com/console/#/cloud/project/{serviceName}/kube/{kubeId}/metrics/etcdUsage~GET

This information will be soon added the control panel, and we are exploring option to send proactive alerts to users approaching the maximum usage.

rverchere commented 1 year ago

Ok, so now we just need to write a prom exporter that retrieve these values and expose as metrics on our cluster :smile:

fkalinowski commented 1 year ago

Hi @mhurtrel when you said "This information will be soon added the control panel", do you have any ETA to provide us ? We want to evaluate if we implement a scrapper on the OVH API or if we are waiting for its integration into the control plane. In the second case (control plane), since we are already scrapping the ApiMetricServer embedded in Kubernetes we will collect it immediately and in the "standard" format...

mhurtrel commented 1 year ago

Hi @fkalinowski unfortunately i was calling the control panel the web UI (aka Manager), not the Kubernetes API. I don't have plan for a Kubernetes API integration yet, so indeed you should developp the OVHcloud Rest API scrapper

rverchere commented 1 year ago

FYI, I started a little prometheus exporter that retrieves etcd quota usage. It's at a very early but working stage.

See https://github.com/rverchere/ovh-mks-exporter