High Availability setup documentation request

aisbaa commented 1 year ago

What would you like to be added:

I would love to ask for the recommendation on High Availability setup for kube-state-metrics.

Why is this needed:

We use some of kube-state-metrics metrics to calculate our service SLI metric. Unfortunately during host rebalancing or host upgrades we do get gaps in SLI metrics because kube-state-metrics gets restarted/rescheduled (becomes unavailable).

Describe the solution you'd like

It would be great to have a paragraph in the README describing HA setup, similar to scaling recommendations.

Additional context

We're using kube-state-metrics bundled by kube-prometheus project (which does run a single instance of kube-state-metrics).

k8s-ci-robot commented 1 year ago

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

dgrisonnet commented 1 year ago

Running kube-state-metrics in HA is pretty much the same as any other applications. You need the following:

2 replicas
anti-affinity rules on hostname, preferably hard anti-affinity to prevent SPOF
rolling update strategy with maxUnavailable set to 1
a PDB with minAvailable: 1

For example that's what I did in kube-prometheus for prometheus-adapter and it should be fairly the same for kube-state-metrics:

@aisbaa do you wanna give it a try at creating the doc?

/help

k8s-ci-robot commented 1 year ago

@dgrisonnet: This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed by commenting with the /remove-help command.

In response to [this](https://github.com/kubernetes/kube-state-metrics/issues/2081): >Running kube-state-metrics in HA is pretty much the same as any other applications. You need the following: >- 2 replicas >- anti-affinity rules on hostname, preferably hard anti-affinity to prevent SPOF >- rolling update strategy with maxUnavailable set to 1 >- a PDB with minAvailable: 1 > >For example that's what I did in kube-prometheus for prometheus-adapter and it should be fairly the same for kube-state-metrics: >- https://github.com/prometheus-operator/kube-prometheus/pull/1095 >- https://github.com/prometheus-operator/kube-prometheus/pull/1136 > >@aisbaa do you wanna give it a try at creating the doc? > >/help >- Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

aisbaa commented 1 year ago

Sounds something I would like to tackle in next July. Then I could ask/help/tackle the same in kube-prometheus project.

whitepiratebaku commented 1 year ago

@dgrisonnet I wonder would the architecture you shared not be an issue for counter type metrics?

dgrisonnet commented 1 year ago

@whitepiratebaku do you have an example in mind of a counter metric that could break?

Normally since both replicas will base themselves on the apiserver data, they will always be in sync since the counters take their value from the apiserver.

leoskyrocker commented 1 year ago

@dgrisonnet since kube-state-metrics stores the fetched data from the apiserver in memory, isn't there a chance where the value is memory has not been updated in one of the ksm replicas, and it ends up with a lower counter?

Also, prometheus would scrape from both replicas, and we can also end up with out-of-order insertions (?)

aisbaa commented 1 year ago

Also, prometheus would scrape from both replicas, and we can also end up with out-of-order insertions (?)

This is interesting, is there a way to detect this?

We have been running HA setup for 3 months now. I definitely helped with missing metrics during restarts. For the small set of metrics that we care about we apply avg. Though it might be that we're missing something.

leoskyrocker commented 1 year ago

To be clear, I don't think out-of-order insertion is too big of a problem, as iiuc it'll be rejected by Prometheus, and you just don't have as many data points as you'll get from running multiple replicas of KSM.

For averaging the values, that's interesting.. There are some problems I see, but it could work if you only specially handle a few metrics from KSM.

The problem is that not all metrics can be averaged, e.g. kube_cronjob_next_schedule_time stores a timestamp. Another example is if you're monitoring a pod with replica status = failed for example, then averaging the 0/1 value leads to very weird interpretation of the alerts..

With these issues, I'm just not very sure that KSM can be HA-ed as easily as mentioned in the original suggestion (?)

aisbaa commented 1 year ago

To be clear, I don't think out-of-order insertion is too big of a problem, as iiuc it'll be rejected by Prometheus, and you just don't have as many data points as you'll get from running multiple replicas of KSM.

Got it, during scrape we don't strip instance label which contains target IP address (ksm pod ip address). That results in 2 physical time series per logical metric, f.e.:

kube_pod_container_status_ready{container="kube-state-metrics", instance="10.16.173.10:8081", job="ha-kube-state-metrics", namespace="default", pod="benchmark-5cb4b9bcbc-rkwql", uid="a565e71b-046d-4406-814c-b2503ab371ad"}
kube_pod_container_status_ready{container="kube-state-metrics", instance="10.17.18.172:8081", job="ha-kube-state-metrics", namespace="default", pod="benchmark-5cb4b9bcbc-rkwql", uid="a565e71b-046d-4406-814c-b2503ab371ad"}

So out-of-order insertion should not be an issue with this setup.

The problem is that not all metrics can be averaged

That might be the case. I have to admit that I didn't put much effort into figuring out if averaging would work for all metrics. Mys assumption is that average of 2 physical metrics with the same value should work as deduplication:

avg without(instance) (kube_pod_container_status_ready)

What do you think? There must be something I missed.

leoskyrocker commented 1 year ago

Mys assumption is that average of 2 physical metrics with the same value should work as deduplication.

That is true when "both are with the same value", i.e. both values are 1 or both values are 0.

However, in my HA test run, sometimes one KSM replica would return 0 and the other would return 1. This depends on when the metric is calculated. While they'll be eventually consistent, it is pretty weird during the transition time. For example, this behavior can lead to alerts firing if someone is not carefully crafting the alert evaluation query.

Does this make sense to you?

(For better HA mode, I'm wondering if KSM should have some built-in leader election and ensure there's only one instance running at a time.)

aisbaa commented 1 year ago

However, in my HA test run, sometimes one KSM replica would return 0 and the other would return 1.

It does make sense. I've seen that only during the transition, which super short, no longer than a single prometheus scrape, which is 30s in our case.

Work around for it could be running 3 KSM instances. Then avg and round might solve the issue:

round(avg without (instance) (kube_pod_container_status_ready))

... this behavior can lead to alerts firing if someone is not carefully crafting the alert evaluation query.

Agree that you need to be more careful then writing alerts when you have multiple KSM instances. Though I haven't seen evidence that transitions state would last more than 1 scrape. So alert evaluation is likely not an issue.

leoskyrocker commented 1 year ago

Thanks @aisbaa, while it works in a way if we carefully craft things. In a few layers we found that it leads to surprises for us, so we decided to not pursue this approach.

For example, the community grafana dashboards don't really have this assumption baked in, and people who views the raw metric could get confused, and it requires us to put strict criteria on the scrape interval and the alert evaluation interval.

It was nice you brought up a workaround for those who need this tho, so thanks for that and all the discussion!

leoskyrocker commented 1 year ago

For example, the community grafana dashboards don't really have this assumption baked in, and people who views the raw metric could get confused, and it requires us to put strict criteria on the scrape interval and the alert evaluation interval.

For better HA mode, I'm wondering if KSM should have some built-in leader election and ensure there's only one instance running at a time.

@dgrisonnet Would it make sense to open up an issue to introduce leader election for HA in kube-state-metrics?

nalshamaajc commented 8 months ago

FWIW, This issue addresses a similar situation that describes how leader election behaved while using both methods (Trying to consolidate things in one place) issues 611

aisbaa commented 7 months ago

I'm fairly biased towards this approach, because there's very little complexity involved in setup and barely any new code is required. The main trade off is the you need more storage space and possibly memory on Prometheus side. A rough HA setup could be:

Run multiple ksm instances (2-3, 2 could be find because each instance just exposes k8s data and does not have to from a consensus between each other).
Scrape metrics from all of them and add a prefix to each metric (f.e.: kube_pod_container_status_ready -> ha_partial_kube_pod_container_status_ready).
Use a bunch of recording rules to produce final metric from partial metrics. This part is fairly tedious, hopefully it could be generated. Example recording rule:

# this is the new code + additional manifests
record: ha_kube_pod_container_status_ready
expr: round(avg without (instance) (ha_partial_kube_pod_container_status_ready))

At Uber we're using steps 1 and 2 of this approach in devpod project to ensure we have HA for small set of metrics we use for uptime SLI calculation.

P.S. Having said that, HA with leader election does sound cool.

aisbaa commented 7 months ago

Might be a small one, but good to add to the pile. Noticed that during KSM restart it is possible to get logical duplicates if deployment strategy is set to rolling update. I believe that this approach would also solve this issue.

ksm-duplicates-during-restart

kubernetes / kube-state-metrics

High Availability setup documentation request #2081

Guidelines