observatorium / thanos-receive-controller

Kubernetes controller to automatically configure Thanos receive hashrings
Apache License 2.0
96 stars 46 forks source link

Update Hash ring only with the in-ready-status replicas of Statefulset #70

Closed spaparaju closed 3 years ago

spaparaju commented 3 years ago

Currently Hash ring is getting updated based on the .spec of the statefulset that is being watched. There are few edge cases where Hash ring is updated pre maturely (eg: replicas take some time to come up) / Hash ring is updated with incorrect replicas if the scaling of statefulset would not succeed (eg: not enough resources on the cluster). Downside of this behaviour is that requests to statefulsets like Thanos-default-receive result in temporary (with successful scaling up of a statefulset) / permanent (with unsuccessful scaling up of a statefulset) HTTP 500s.

This fix update Hash ring only with the replicas of the statefulset which are in 'Ready' status.

spaparaju commented 3 years ago

As this PR updates hash ring based on replicas in the 'Ready state', understandably tests are failing as the current tests are performed against a 'fake cluster'.

bwplotka commented 3 years ago

And let's think about the typical scenario of single node getting down for a while (restart,crash). We might not want to retrigger whole hashring update and causing cascading error 🤗

Maybe there is something in between we could do?

bwplotka commented 3 years ago

cc @brancz

metalmatze commented 3 years ago

As part of an effort to make auto scaling more reliable I happened to work on the same stuff. PR incoming.

metalmatze commented 3 years ago

I've commented on that topic on the CNCF Slack in #thanos-dev. Trying not to repeat the findings please allow me to simply link you there: https://cloud-native.slack.com/archives/CL25937SP/p1616512341031300

metalmatze commented 3 years ago

I've opened my attempt at solving the problem in https://github.com/observatorium/thanos-receive-controller/pull/71 Seems like we've accidentally work on this at the same time. Sorry.

spaparaju commented 3 years ago

closing this PR in favor of https://github.com/observatorium/thanos-receive-controller/pull/75 to address the scenario of Hashring contain endpoints of replicas in ready status even if scaling up statefulsets do not reach intentended # of replicas