Closed matej-g closed 2 years ago
Hi,
I have build an image from this PR, and unfortunately the controller doesnt seems to behave as expected. What I am expected: While scaling up my receive STS, the controller updates de configmaps only when replicas are ready
My architecture looks like this: thanos distributor -> n* thanos receive ingester. The hashring file is present only on the distributor and is managed by the thanos receive controller, my replication.factor is at 3.
My controller config:
containers:
- args:
- --configmap-name=thanos-receive-base
- --configmap-generated-name=thanos-receive-generated
- --file-name=hashrings.json
- --namespace=$(NAMESPACE)
- --allow-only-ready-replicas=true
I did a scale up from 15 to 18 replicas, so directly 3 replicas in one update and my service went down with the errors below:
level=debug ts=2022-09-01T12:15:31.639283401Z caller=handler.go:555 component=receive component=receive-handler tenant=lgcy msg="failed to handle request" err="2 errors: replicate write request for endpoint thanos-test-receiver-16.thanos-test-receiver-headless.meta-monitoring-test:10901: quorum not reached: target not available; replicate write request for endpoint thanos-test-receiver-15.thanos-test-receiver-headless.meta-monitoring-test:10901: quorum not reached: target not available"
The configmap generated by the controller has been update with all the pods including the not ready ones and also the not created ones.
Did I miss something in my test? Thank in advance.
My controller's serviceaccount didnt have read permission on the pod resources. After I have edited the the controller role it is working fine:
- apiGroups:
- ""
resources:
- "pods"
verbs:
- "get"
- "watch"
- "list"
So I have tried again and this Pr doesn't work. I have build a image based on https://github.com/observatorium/thanos-receive-controller/pull/89 and it works as expected. I have noticed than in your PR some part was missing, for example https://github.com/michael-burt/thanos-receive-controller/blob/allow-only-ready-replicas/main.go#L595
So I have tried again and this Pr doesn't work. I have build a image based on #89 and it works as expected. I have noticed than in your PR some part was missing, for example https://github.com/michael-burt/thanos-receive-controller/blob/allow-only-ready-replicas/main.go#L595
Hey @lud97x thanks for taking the time to try this out! So after you adjusted the service account permissions, did it work?
The reason for the difference is stated in the README update I have added as part of this PR. I removed that part because scaling on every pod (un)readiness could potentially lead to a frequent hashring changes, see the explanation in the README in my PR.
Based on work done in #89
This change adds a flag
--allow-only-ready-replicas
that changes the behavior of controller on a scale up - if enabled, the controller will first wait on all replicas to be ready before adding them to the hashring. The feature is documented as well.