Closed spaparaju closed 3 years ago
As this PR updates hash ring based on replicas in the 'Ready state', understandably tests are failing as the current tests are performed against a 'fake cluster'.
And let's think about the typical scenario of single node getting down for a while (restart,crash). We might not want to retrigger whole hashring update and causing cascading error 🤗
Maybe there is something in between we could do?
cc @brancz
As part of an effort to make auto scaling more reliable I happened to work on the same stuff. PR incoming.
I've commented on that topic on the CNCF Slack in #thanos-dev. Trying not to repeat the findings please allow me to simply link you there: https://cloud-native.slack.com/archives/CL25937SP/p1616512341031300
I've opened my attempt at solving the problem in https://github.com/observatorium/thanos-receive-controller/pull/71 Seems like we've accidentally work on this at the same time. Sorry.
closing this PR in favor of https://github.com/observatorium/thanos-receive-controller/pull/75 to address the scenario of Hashring contain endpoints of replicas in ready status even if scaling up statefulsets do not reach intentended # of replicas
Currently Hash ring is getting updated based on the .spec of the statefulset that is being watched. There are few edge cases where Hash ring is updated pre maturely (eg: replicas take some time to come up) / Hash ring is updated with incorrect replicas if the scaling of statefulset would not succeed (eg: not enough resources on the cluster). Downside of this behaviour is that requests to statefulsets like Thanos-default-receive result in temporary (with successful scaling up of a statefulset) / permanent (with unsuccessful scaling up of a statefulset) HTTP 500s.
This fix update Hash ring only with the replicas of the statefulset which are in 'Ready' status.