[refinery] Add option to run refinery as a statefulset

honeycombio / helm-charts

Helm repository and charts for Honeycomb

Apache License 2.0

30 stars 39 forks source link

[refinery] Add option to run refinery as a statefulset #237

Open kamalmarhubi opened 1 year ago

kamalmarhubi commented 1 year ago

Using a statefulset allows giving the pods a stable network identity. Setting setHostnameAsFQDN means that this stable network identity is what os.Hostname reports. Together, they allow using the hostname in the peer list, so that the peer list remains stable even as pods are rescheduled. This improves trace routing stability during refinery upgrades and Kubernetes cluster operations (upgrades / scale-downs), and even makes it possible to run refinery with an affinity preference for spot instances.

kamalmarhubi commented 1 year ago

We've been running a statefulset on spot instances in GKE for a couple of days and it looks like it's working out ok.

Main reason this is a draft: should I instead put the statefulset in a separate file and instead avoid repetition via a pod template helper + some mustMerge stuff?

TylerHelmuth commented 1 year ago

Main reason this is a draft: should I instead put the statefulset in a separate file and instead avoid repetition via a pod template helper + some mustMerge stuff?

Ya lets have separate deployment.yaml and statefulset.yaml and try to reuse what we can with helper template. A little bit of duplication is fine if it means simplicity..

kamalmarhubi commented 1 year ago

@TylerHelmuth I finally got around to this. I had to debug some nindent wrongness* in _pod.tpl. I think I got it right, but I only tried our setup, where it didn't result in any diff from the pre-split version (after rebasing onto current HEAD). I don't know if some other sets of values might have annoying nindent stuff wrong, but I hope not.

‌* I love text templating white-space–sensitive structured data 🙈

kamalmarhubi commented 1 year ago

@TylerHelmuth fyi I just rebased this on top of the current main so we could upgrade to refinery 2.x. We've been running refinery as a statefulset for slightly over 3 months at this point with no real issues.

TylerHelmuth commented 1 year ago

@kamalmarhubi thats awesome! I'm glad it is working well. There is some other 2.x cleanup we are prioritizing right now but we'll get back to this PR after.

kamalmarhubi commented 2 months ago

@TylerHelmuth bumpty bump on this PR. We're upgrading refinery and I'm wondering if I should rebase this once more or not.

It kind of sounds like the pub/sub changes in 2.7, specifically sharing peer membership, makes cluster membership changes less disruptive; is that right? I think I'd prefer stable pod identity, but I'd also like to stop maintaining a long-running fork of the official chart—whether by rebasing once more and merging, or dropping the statefulset option.

Let me know which works best for you!

TylerHelmuth commented 2 months ago

@kamalmarhubi you're right, and 2.8 should expand on those features. There is working ongoing in refinery to make scaling it up and down smoother. Let's wait for that work to complete and then revisit if statefulset is still something we'll need to support.

kamalmarhubi commented 2 months ago

@TylerHelmuth sounds good. How does the cluster respond to non-resize membership changes under 2.7? More precisely, how disruptive is it if the membership changes as the deployment's pods are cycled and come up with new names? I recall this was fairly ungood in the past.

TylerHelmuth commented 2 months ago

@kamalmarhubi in 2.7 peer membership updates are now pub/sub, so it happens quicker, but the refinery instances still have to rebalance when that happens.