observatorium / thanos-receive-controller

Kubernetes controller to automatically configure Thanos receive hashrings
Apache License 2.0
93 stars 42 forks source link

Generated Hashring to contain only the statefulset replicas in the Ready status #75

Closed spaparaju closed 2 years ago

spaparaju commented 3 years ago

Currently when the statefulset is scaled (lower number to higher number), pods are being checked to be in ready condition before the respective endpoints are considered to be part of Hashring configMap. However in subsequent runs of thanos-receiver-controller loop, the entire of statefulset spec. of replicas (intended # of replicas) is considered for Hashring generation. This can be a problem if scale up fails to reach the intended # of replicas.

This PR :

spaparaju commented 3 years ago

cc @metalmatze @squat

spaparaju commented 3 years ago

@metalmatze can I get a review for this PR. thanks!

metalmatze commented 3 years ago

I'm still not 100% sure if this is really needed. :thinking: The scale up checks for the individual pods to be ready before continuing with the scale up. So at the end of the iteration the Pods are all ready? What exactly am I missing?

spaparaju commented 3 years ago

This fix covers a specific case - updating hash-ring if scale up does not reach desired # of replicas. Without this fix, successive operator sync() loop updates the hash-ring (always ~= even if the replicas are not ready) with the 'desired the # of replicas' (spec.replicas)

bwplotka commented 3 years ago

I would love to test this out locally first within e2e. @ianbillett could you take a look (not write test, just look if that makes sense)

spaparaju commented 3 years ago

Thanks for updating this. I still think making hashring changes on every unready pod can cause failover scenario.

For example:

  • If we have 7 replicas
  • 7th is suddenly crashing for some reason
  • controller changing hashring to have only 6 replicas
  • it increases load on 6 replicas so e.g 6th is crashing too
  • now controller changes to 5 replicas, which brings more load to 5
  • whole system is on fire for all tenants.

This scenario (one of the replicas in Ready status going down == not able to serve traffic) is a possible case, limiting the incoming traffic (based on replica count in Ready status) can be one solution. Role of hashring-controller (or plain Kube SS controller) is to contain / send traffic to the replica addresses which can serve traffic. WDYT? 🤔

I think we should wait with replica number increase on hashring until first pod is ready yes, but later unreadiness should be assumed as crash, no? Hash-ring would contain its first valid address only after the first SS pod is ready to serve the traffic.

bwplotka commented 2 years ago

To sum our offline discussion on our sync:

Next steps:

  1. Add backoff rate limit per receive that can auto-tune based on resource limits
  2. Merge this PR when we do step 1
bwplotka commented 2 years ago

Added ticket: https://github.com/thanos-io/thanos/issues/4425

spaparaju commented 2 years ago

To sum our offline discussion on our sync:

  • This change overall makes sense from the controller point of view. As mentioned above from network proxy it makes sense to provide the current situation to hashing allowing receive to deal with missing pods.
  • We don't have all measurements to stop the cascading failures, so we should not merge this PR until we have that. We are missing local per receive rate-limit based on current capacity e.g resources we use. The current gubernator helps us but it's very manual and has only statically set rate limiting per whole system

Next steps:

  1. Add backoff rate limit per receive that can auto-tune based on resource limits
  2. Merge this PR when we do step 1 backoff ratelimit per receive makes sense. thanks for creating an issue for tracking. Cascading failures, backoff rate-limiting are independent of this PR as cascading failures can happen even with the current master release of the receive-controller. (a failed write to a not-ready pod will be retried by a client to be passed onto to a ready pod)
bwplotka commented 2 years ago

Still I will be experimenting with different ideas here. My plan is to also create local test scenarios using e2e to mimic controller and failure cases.

bwplotka commented 2 years ago

Early attempt already not working as expected: https://github.com/thanos-io/thanos/pull/4968

spaparaju commented 2 years ago

There are two parts to the problem of ingestion failures:

  1. Few replicas in the non-Ready status are sneaking into the hash-ring
  2. Better, consistent hash-ring

This PR fixes (1) from the above list.

bill3tt commented 2 years ago

@spaparaju can we close this PR now that we have https://github.com/observatorium/thanos-receive-controller/pull/80?

spaparaju commented 2 years ago

Yes Ian. Opened a new PR (#80) for fixing this bug. Closing this PR.