Generated Hashring to contain only the statefulset replicas in the Ready status

spaparaju commented 3 years ago

Currently when the statefulset is scaled (lower number to higher number), pods are being checked to be in ready condition before the respective endpoints are considered to be part of Hashring configMap. However in subsequent runs of thanos-receiver-controller loop, the entire of statefulset spec. of replicas (intended # of replicas) is considered for Hashring generation. This can be a problem if scale up fails to reach the intended # of replicas.

This PR :

ensures scale up of replicas (even if scale up does not reach intentended # of replicas ) result in generated Hashring to contain endpoints of replicas in the ready status
removes separate logic for scaling up to make sure all the thanos-receive-controller runs 'checks for pods to be in ready status' before their service endpoints are added to the Hashring.
Update tests to make sure the path for 'checking for pods to be in ready status' is tested

spaparaju commented 3 years ago

cc @metalmatze @squat

spaparaju commented 3 years ago

@metalmatze can I get a review for this PR. thanks!

metalmatze commented 3 years ago

I'm still not 100% sure if this is really needed. :thinking: The scale up checks for the individual pods to be ready before continuing with the scale up. So at the end of the iteration the Pods are all ready? What exactly am I missing?

spaparaju commented 3 years ago

This fix covers a specific case - updating hash-ring if scale up does not reach desired # of replicas. Without this fix, successive operator sync() loop updates the hash-ring (always ~= even if the replicas are not ready) with the 'desired the # of replicas' (spec.replicas)

bwplotka commented 3 years ago

I would love to test this out locally first within e2e. @ianbillett could you take a look (not write test, just look if that makes sense)

spaparaju commented 3 years ago

Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.

For example:

If we have 7 replicas

7th is suddenly crashing for some reason

controller changing hashring to have only 6 replicas

it increases load on 6 replicas so e.g 6th is crashing too

now controller changes to 5 replicas, which brings more load to 5

whole system is on fire for all tenants.

This scenario (one of the replicas in Ready status going down == not able to serve traffic) is a possible case, limiting the incoming traffic (based on replica count in Ready status) can be one solution. Role of hashring-controller (or plain Kube SS controller) is to contain / send traffic to the replica addresses which can serve traffic. WDYT? 🤔

I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no? Hash-ring would contain its first valid address only after the first SS pod is ready to serve the traffic.

bwplotka commented 2 years ago

To sum our offline discussion on our sync:

This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.
We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

Add backoff rate limit per receive that can auto-tune based on resource limits
Merge this PR when we do step 1

bwplotka commented 2 years ago

Added ticket: https://github.com/thanos-io/thanos/issues/4425

spaparaju commented 2 years ago

To sum our offline discussion on our sync:

This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.

We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

Add backoff rate limit per receive that can auto-tune based on resource limits

Merge this PR when we do step 1 backoff ratelimit per receive makes sense. thanks for creating an issue for tracking. Cascading failures, backoff rate-limiting are independent of this PR as cascading failures can happen even with the current master release of the receive-controller. (a failed write to a not-ready pod will be retried by a client to be passed onto to a ready pod)

bwplotka commented 2 years ago

Still I will be experimenting with different ideas here. My plan is to also create local test scenarios using e2e to mimic controller and failure cases.

bwplotka commented 2 years ago

Early attempt already not working as expected: https://github.com/thanos-io/thanos/pull/4968

spaparaju commented 2 years ago

There are two parts to the problem of ingestion failures:

Few replicas in the non-Ready status are sneaking into the hash-ring
Better, consistent hash-ring

This PR fixes (1) from the above list.

bill3tt commented 2 years ago

@spaparaju can we close this PR now that we have https://github.com/observatorium/thanos-receive-controller/pull/80?

spaparaju commented 2 years ago

Yes Ian. Opened a new PR (#80) for fixing this bug. Closing this PR.

observatorium / thanos-receive-controller

Generated Hashring to contain only the statefulset replicas in the Ready status #75