annotate all pods under our domain, not just those in the STS

tekicode commented 1 year ago

annotatePods() only affects pods which are running in RouterIngestor mode: Pods which declare a hashring file AND are also a member of that hashring. It works by iterating through every expected pod in the STS. This has a couple of problems:

If the size of the STS is larger than the number of running pods (failed pods) there will be failures to update those pods. This is a nuisance error.
If we're operating in a split setup (Router separately from Ingestor) the controller does not know to annotate the Router pods, it only annotates the discovered STS pods. This means the Routers are left to reload their CM on their fallback timeout, instead of immediately.
Only StatefulSets are affected, if using Routers as Deployments instead of STS, the controller will not annotate them with updates.

This change will discover any pods with the config.StatefulSetLabel value default: controller.receive.thanos.io=thanos-receive-controller LABEL and add a controller.receive.thanos.io/lastControllerUpdate annotation to each pod when the --annotate-pods-on-change flag is used and a new CM is generated.

It is not necessary to specify the controller.receive.thanos.io/hashring label when using this feature on pods which are not part of the hashring.

tekicode commented 1 year ago

The failures in the checks/Generate look unrelated jsonnet updates?

tekicode commented 1 year ago

@matej-g Could you review? Thank you!

saswatamcode commented 1 year ago

cc: @philipgough

squat commented 1 year ago

@tekicode if we go down this road, then we must change the configuration flag name from --statefulset-label [0] to something different, e.g. --label because it is misleading to users to suggest that we only use this to select statefulset pods. In this case, I would suggest we continue to accept --statefulset-label flag and deprecate it in favor of a new one that takes preference over it. This will allow existing users of the thanos-receive-controller to upgrade gracefully.

Separately, I am realizing now that the description of the --annotate-pods flag is also misleading. It indicates that we will annotate the pod with the hash of the latest configuration, but that is not true: we annotate the pods with a timestamp.

[0] https://github.com/observatorium/thanos-receive-controller/blob/master/main.go#L81

philipgough commented 1 year ago

@squat aside from agreeing with all your points I actually think the --annotate-pods flag is risky. It uses the kube client to do an update but it doesn't handle failures very well. Take an example where you might have a large hashring, you go to update a Pod with the annotation and you get a conflict. It just moves on and now you can have different replicas with different understandings of how the hashring looks causing issues with replication.

tekicode commented 1 year ago

@squat aside from agreeing with all your points I actually think the --annotate-pods flag is risky. It uses the kube client to do an update but it doesn't handle failures very well. Take an example where you might have a large hashring, you go to update a Pod with the annotation and you get a conflict. It just moves on and now you can have different replicas with different understandings of how the hashring looks causing issues with replication.

I'm not sure how to address this comment. Without using annotate-pods, kubelet will update the configmap mounted on each container at most once every minute. It's not particularly at the top of the minute, just one minute timeout. The reason for the annotate-pods is to speed that up. So, if it fails, the fallback would be the default behavior (which is not ideal, I'll admit)

I already see annotate-pods failures with the current code; so my changes here do not address the root behavior, only make it more useful.

tekicode commented 1 year ago

@tekicode if we go down this road, then we must change the configuration flag name from --statefulset-label [0] to something different, e.g. --label because it is misleading to users to suggest that we only use this to select statefulset pods. In this case, I would suggest we continue to accept --statefulset-label flag and deprecate it in favor of a new one that takes preference over it. This will allow existing users of the thanos-receive-controller to upgrade gracefully.

Separately, I am realizing now that the description of the --annotate-pods flag is also misleading. It indicates that we will annotate the pod with the hash of the latest configuration, but that is not true: we annotate the pods with a timestamp.

[0] https://github.com/observatorium/thanos-receive-controller/blob/master/main.go#L81

Hi thanks for the review.

Slightly concerned that by changing the flag as you suggest, we would indicate to users that the controller is capable of allowing non statefulset pods into the hashring; which it is not and would not be recommended.

The intention of my change here is for those of us using a split setup, with Statefulsets as endpoint targets, and Deployments working as autoscaling routers. These deployments are never in the endpoints list.

Either way, it sounds like something that is more solved with documentation; would we rather docs say "Do not use --label to select Deployments for the hashring, as --label is only meant for statefulset selection" or "--annotate-pods will annotate ANY pod with this label, not just statefulset pods"

IMO, I believe users would be more confused by trying to use --label and believing it would add Deployments to the endpoints list.

Naming things is hard, it's more important we see the controller do the functionality than picking a name. Let me know how we can move forward!

philipgough commented 1 year ago

I'm not sure how to address this comment. Without using annotate-pods, kubelet will update the configmap mounted on each container at most once every minute. It's not particularly at the top of the minute, just one minute timeout. The reason for the annotate-pods is to speed that up. So, if it fails, the fallback would be the default behavior (which is not ideal, I'll admit)

I already see annotate-pods failures with the current code; so my changes here do not address the root behavior, only make it more useful.

You are not sure how to address it in what way? I'm aware of the reason for the flag to exist in the first place. I was merely commenting within a conversation where we were discussing issues with the existing approach.

From a technical perspective, it can likely be addressed in some way by using an event driven approach and re-enqueuing Pods on failure to update, but that would make the controller significantly more complicated too.

squat commented 1 year ago

re-enqueuing Pods on failure to update

That would mean using controller-runtime tooling like the queue and caches etc. It would for sure make the controller more robust, turning it into a fully-fledged operator pretty much. Not a bad thing! It's just a trade-off: more work and complexity vs robustness and functionality.

squat commented 1 year ago

Another possible approach would be to eschew pod annotation entirely and instead instruct users to install third party tools like https://github.com/wave-k8s/wave or https://github.com/stakater/Reloader of they want "instantaneous" config updates.

tekicode commented 1 year ago

@squat @philipgough I unfortunately don't have the time to invest in making the controller more robust in the sync of the hashring file. Can we table that for another PR or issue?

@philipgough Addressing --statefulset-label vs --label I am going through right now to deprecate statefulset-label. What should the verbiage be for the use of the new flag?

The label Pods must have to be watched by the controller

The label StatefulSets must have to be watched by the controller

The label Workloads must have to be watched by the controller.

Each of these has a different implication for the end user in how the controller can function and I see issues with the language of all three.

Please let me know if this is an impasse. We've found use of this controller in our pre-production environments and it's Pretty Good(tm) just a few things here and there to make it better.

squat commented 1 year ago

@tekicode i agree with doing the Bigger Refactor in a different PR :) as for verbiage, I like The label workloads must have to be watched by the controller. (with a lowercase w).

squat commented 1 year ago

Also, I fixed the generated files in https://github.com/observatorium/thanos-receive-controller/pull/121 so that CI passes. Please rebase your PR on main :)

tekicode commented 1 year ago

@tekicode i agree with doing the Bigger Refactor in a different PR :) as for verbiage, I like The label workloads must have to be watched by the controller. (with a lowercase w).

Thank you for the review @squat Addressed this feedback with https://github.com/observatorium/thanos-receive-controller/pull/120/commits/519d41ca82e81bd46a5dcc735425857c927b5f84

Please let me know what you think!

tekicode commented 1 year ago

Changes in https://github.com/observatorium/thanos-receive-controller/pull/120/commits/45f615cde4438da4b381b290ae76d72b74768492 are due to a limit on statements allowed in main() according to the lint rules.

tekicode commented 1 year ago

@squat @philipgough any chance either of you could review? ty!

tekicode commented 1 year ago

@saswatamcode available to review? ty!

philipgough commented 1 year ago

Sorry, I will take a look at this one asap

tekicode commented 1 year ago

@philipgough Good to merge?

observatorium / thanos-receive-controller

annotate all pods under our domain, not just those in the STS #120