Open aisbaa opened 1 year ago
This issue is currently awaiting triage.
If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted
label and provide further guidance.
The triage/accepted
label can be added by org members by writing /triage accepted
in a comment.
Running kube-state-metrics in HA is pretty much the same as any other applications. You need the following:
For example that's what I did in kube-prometheus for prometheus-adapter and it should be fairly the same for kube-state-metrics:
@aisbaa do you wanna give it a try at creating the doc?
/help
@dgrisonnet: This request has been marked as needing help from a contributor.
Please ensure that the issue body includes answers to the following questions:
For more details on the requirements of such an issue, please see here and ensure that they are met.
If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help
command.
Sounds something I would like to tackle in next July. Then I could ask/help/tackle the same in kube-prometheus project.
@dgrisonnet I wonder would the architecture you shared not be an issue for counter type metrics?
@whitepiratebaku do you have an example in mind of a counter metric that could break?
Normally since both replicas will base themselves on the apiserver data, they will always be in sync since the counters take their value from the apiserver.
@dgrisonnet since kube-state-metrics stores the fetched data from the apiserver in memory, isn't there a chance where the value is memory has not been updated in one of the ksm replicas, and it ends up with a lower counter?
Also, prometheus would scrape from both replicas, and we can also end up with out-of-order insertions (?)
Also, prometheus would scrape from both replicas, and we can also end up with out-of-order insertions (?)
This is interesting, is there a way to detect this?
We have been running HA setup for 3 months now. I definitely helped with missing metrics during restarts. For the small set of metrics that we care about we apply avg
. Though it might be that we're missing something.
To be clear, I don't think out-of-order insertion is too big of a problem, as iiuc it'll be rejected by Prometheus, and you just don't have as many data points as you'll get from running multiple replicas of KSM.
For averaging the values, that's interesting.. There are some problems I see, but it could work if you only specially handle a few metrics from KSM.
The problem is that not all metrics can be averaged, e.g. kube_cronjob_next_schedule_time stores a timestamp.
Another example is if you're monitoring a pod with replica status = failed
for example, then averaging the 0/1 value leads to very weird interpretation of the alerts..
With these issues, I'm just not very sure that KSM can be HA-ed as easily as mentioned in the original suggestion (?)
To be clear, I don't think out-of-order insertion is too big of a problem, as iiuc it'll be rejected by Prometheus, and you just don't have as many data points as you'll get from running multiple replicas of KSM.
Got it, during scrape we don't strip instance
label which contains target IP address (ksm pod ip address). That results in 2 physical time series per logical metric, f.e.:
kube_pod_container_status_ready{container="kube-state-metrics", instance="10.16.173.10:8081", job="ha-kube-state-metrics", namespace="default", pod="benchmark-5cb4b9bcbc-rkwql", uid="a565e71b-046d-4406-814c-b2503ab371ad"}
kube_pod_container_status_ready{container="kube-state-metrics", instance="10.17.18.172:8081", job="ha-kube-state-metrics", namespace="default", pod="benchmark-5cb4b9bcbc-rkwql", uid="a565e71b-046d-4406-814c-b2503ab371ad"}
So out-of-order insertion should not be an issue with this setup.
The problem is that not all metrics can be averaged
That might be the case. I have to admit that I didn't put much effort into figuring out if averaging would work for all metrics. Mys assumption is that average of 2 physical metrics with the same value should work as deduplication:
avg without(instance) (kube_pod_container_status_ready)
What do you think? There must be something I missed.
Mys assumption is that average of 2 physical metrics with the same value should work as deduplication.
That is true when "both are with the same value", i.e. both values are 1 or both values are 0.
However, in my HA test run, sometimes one KSM replica would return 0 and the other would return 1. This depends on when the metric is calculated. While they'll be eventually consistent, it is pretty weird during the transition time. For example, this behavior can lead to alerts firing if someone is not carefully crafting the alert evaluation query.
Does this make sense to you?
(For better HA mode, I'm wondering if KSM should have some built-in leader election and ensure there's only one instance running at a time.)
However, in my HA test run, sometimes one KSM replica would return 0 and the other would return 1.
It does make sense. I've seen that only during the transition, which super short, no longer than a single prometheus scrape, which is 30s in our case.
Work around for it could be running 3 KSM instances. Then avg
and round
might solve the issue:
round(avg without (instance) (kube_pod_container_status_ready))
... this behavior can lead to alerts firing if someone is not carefully crafting the alert evaluation query.
Agree that you need to be more careful then writing alerts when you have multiple KSM instances. Though I haven't seen evidence that transitions state would last more than 1 scrape. So alert evaluation is likely not an issue.
Thanks @aisbaa, while it works in a way if we carefully craft things. In a few layers we found that it leads to surprises for us, so we decided to not pursue this approach.
For example, the community grafana dashboards don't really have this assumption baked in, and people who views the raw metric could get confused, and it requires us to put strict criteria on the scrape interval and the alert evaluation interval.
It was nice you brought up a workaround for those who need this tho, so thanks for that and all the discussion!
@dgrisonnet Would it make sense to open up an issue to introduce leader election for HA in kube-state-metrics?
FWIW, This issue addresses a similar situation that describes how leader election behaved while using both methods (Trying to consolidate things in one place) issues 611
I'm fairly biased towards this approach, because there's very little complexity involved in setup and barely any new code is required. The main trade off is the you need more storage space and possibly memory on Prometheus side. A rough HA setup could be:
kube_pod_container_status_ready
-> ha_partial_kube_pod_container_status_ready
).# this is the new code + additional manifests
record: ha_kube_pod_container_status_ready
expr: round(avg without (instance) (ha_partial_kube_pod_container_status_ready))
At Uber we're using steps 1 and 2 of this approach in devpod project to ensure we have HA for small set of metrics we use for uptime SLI calculation.
P.S. Having said that, HA with leader election does sound cool.
Might be a small one, but good to add to the pile. Noticed that during KSM restart it is possible to get logical duplicates if deployment strategy is set to rolling update. I believe that this approach would also solve this issue.
What would you like to be added:
I would love to ask for the recommendation on High Availability setup for kube-state-metrics.
Why is this needed:
We use some of kube-state-metrics metrics to calculate our service SLI metric. Unfortunately during host rebalancing or host upgrades we do get gaps in SLI metrics because kube-state-metrics gets restarted/rescheduled (becomes unavailable).
Describe the solution you'd like
It would be great to have a paragraph in the README describing HA setup, similar to scaling recommendations.
Additional context
We're using kube-state-metrics bundled by kube-prometheus project (which does run a single instance of kube-state-metrics).