Probe Statefulset Pods until healthy upon scale up

observatorium / thanos-receive-controller

Kubernetes controller to automatically configure Thanos receive hashrings

Apache License 2.0

96 stars 46 forks source link

Probe Statefulset Pods until healthy upon scale up #71

Closed metalmatze closed 3 years ago

metalmatze commented 3 years ago

This is crucial to not send requests to receivers too early. In combination with https://github.com/thanos-io/thanos/pull/3845 for quicker reloading of Thanos Receive Routes I was able to achieve > 99% availability in errors+latency while scaling up from 3 replicas to 20.

2-receiver-route-improve-controller

metalmatze commented 3 years ago

Looking into linting failures.

jmichalek132 commented 3 years ago

One question about this why not instead of this just look at endpoints of a service pointing to the thanos-receive statefulset?

metalmatze commented 3 years ago

One question about this why not instead of this just look at endpoints of a service pointing to the thanos-receive statefulset?

Hm. Interesting. That might just work the same, yeah. I feel like this current approach gives us a little bit more control though. WDYT?

jmichalek132 commented 3 years ago

One question about this why not instead of this just look at endpoints of a service pointing to the thanos-receive statefulset?

Hm. Interesting. That might just work the same, yeah. I feel like this approach gives us a little bit more control though. WDYT?

The behavior would be slightly different, especially if you were to also remove the instance from configmap once it's removed from endpoints. But it would also respect other things such as failureThreshold,periodSeconds,.. Honestly I would prefer that but I an not sure about what consequences it had on thanos receive behavior.

metalmatze commented 3 years ago

I'll think about it a bit more. If you were to hack a bit on this code to make it somewhat work I'm happy to run it with the benchmark suite again so we get some numbers to compare :+1:

metalmatze commented 3 years ago

Seems like this fails now, because the current master is broken too..

jmichalek132 commented 3 years ago

I will find some time for it over the weekend. Is that okay?

spaparaju commented 3 years ago

For testing this PR (as the current tests are against a fake K8S client), I have maxed out cluster resources (on minikube each Thanos-receive pod starts with 512 MB memory) by scaling up Thanos-receive pods. With these cluster maxed out conditions, thanos-receive pods would not reach the indicated .spec.replicas and the Obs. hash-ring would need to reflect the URLs which are pointing only 'replicas in Ready status'. Here is the screenshot of the this test result where hash-ring is out of sync with the 'replicas in ready status'.

metalmatze commented 3 years ago

I think you're talking about a slightly different problem. More of an addition to this existing PR which we should be able to handle in another PR building on top of this. What do you think?

metalmatze commented 3 years ago

I'd be happy to go forward with this and take care of other improvements in future PRs.

christopherzli commented 10 months ago

just curious, this would not work when multiple share the same hashring right?