Closed metalmatze closed 3 years ago
Looking into linting failures.
One question about this why not instead of this just look at endpoints of a service pointing to the thanos-receive statefulset?
One question about this why not instead of this just look at endpoints of a service pointing to the thanos-receive statefulset?
Hm. Interesting. That might just work the same, yeah. I feel like this current approach gives us a little bit more control though. WDYT?
One question about this why not instead of this just look at endpoints of a service pointing to the thanos-receive statefulset?
Hm. Interesting. That might just work the same, yeah. I feel like this approach gives us a little bit more control though. WDYT?
The behavior would be slightly different, especially if you were to also remove the instance from configmap once it's removed from endpoints. But it would also respect other things such as failureThreshold,periodSeconds,.. Honestly I would prefer that but I an not sure about what consequences it had on thanos receive behavior.
I'll think about it a bit more. If you were to hack a bit on this code to make it somewhat work I'm happy to run it with the benchmark suite again so we get some numbers to compare :+1:
Seems like this fails now, because the current master is broken too..
I will find some time for it over the weekend. Is that okay?
For testing this PR (as the current tests are against a fake K8S client), I have maxed out cluster resources (on minikube each Thanos-receive pod starts with 512 MB memory) by scaling up Thanos-receive pods. With these cluster maxed out conditions, thanos-receive pods would not reach the indicated .spec.replicas and the Obs. hash-ring would need to reflect the URLs which are pointing only 'replicas in Ready status'. Here is the screenshot of the this test result where hash-ring is out of sync with the 'replicas in ready status'.
I think you're talking about a slightly different problem. More of an addition to this existing PR which we should be able to handle in another PR building on top of this. What do you think?
I'd be happy to go forward with this and take care of other improvements in future PRs.
just curious, this would not work when multiple share the same hashring right?
This is crucial to not send requests to receivers too early. In combination with https://github.com/thanos-io/thanos/pull/3845 for quicker reloading of Thanos Receive Routes I was able to achieve > 99% availability in errors+latency while scaling up from 3 replicas to 20.