rqlite / helm-charts

Helm charts for rqlite
MIT License
9 stars 0 forks source link

Initial release of rqlite Helm chart #1

Closed jtackaberry closed 10 months ago

jtackaberry commented 10 months ago

Still in development, but fairly comprehensive already. Submitting a draft PR to begin the review process.

Remaining work items:

While going through active development, I decided to include the develop branch in the chart release action so that 0.x versions will get released, so that aspect can also be tested before moving to 1.0.0.

There's a lot here, especially in values.yaml, so I'd be grateful for any corrections or suggestions on best practices.

The template files themselves are probably less important to review, but feel free. Helm (Go+sprig) template syntax can be a bit arcane (and not in the good way).

jtackaberry commented 10 months ago

@otoolep I also wanted to ask, is there a reason not to use -raft-cluster-remove-shutdown=true by default?

otoolep commented 10 months ago

@otoolep I also wanted to ask, is there a reason not to use -raft-cluster-remove-shutdown=true by default?

There isn't a strong correlation between shutting down a node, and wanting to actually remove it from the cluster IMHO. Removing a node from a cluster changes the quorum requirements, and that's not something that should happen by default. It should only happen if someone explicitly requests this mode. I believe this is the way Consul works.

jtackaberry commented 10 months ago

There isn't a strong correlation between shutting down a node, and wanting to actually remove it from the cluster IMHO.

Sorry, my question as-phrased was a dumb one. I neglected to mention an important detail:

I was thinking about this specifically in the context of read-only nodes. I was contemplating how Horizontal Pod Autoscaling (HPA) would work with the readonly pool, once I implement it. Because supporting autoscaling for readonly replicas would be a nice capability.

HPA works basically equivalently to kubectl scale statefulset/rqlite --replicas=<n>. When scaling up, readonly nodes will join the cluster just fine. When scaling down, per https://rqlite.io/docs/clustering/read-only-nodes/#handling-failure it will generate quite a lot of noise in the remaining nodes' logs (though AFAICT be otherwise harmless?).

-raft-cluster-remove-shutdown=true would handle this for scale-down operations. Of course, it does mean, as you say, the readonly node will leave the cluster even on standard restarts. But is this a problem?

jtackaberry commented 10 months ago

Although, in the context of voters, it does occur to me the chart should offer some means of implementing https://rqlite.io/docs/clustering/general-guidelines/#dealing-with-failure

More generally, if the user explicitly lowers the replicaCount chart value, it's probably not unreasonable to assume they do want to jettison those nodes. I could generate peers.json -- AFAICT everything needed to generate that file is known at chart deployment time (using FQDNs instead of IPs for address of course) -- but I think this could be problematic in the context of readonly replicas and autoscaling them.

Once I get the readonly stuff implemented I'll experiment with that.

Alternatively, as suggested by the docs, I relegate peers.json to "break glass in case of emergency" situations and the chart provides some other logic to support scaling voters down. This is trickier than it sounds, because it means inside the pod there needs to be a way for it to dynamically determine if it should leave the cluster on termination.

It's tricky but not impossible: I could have a custom preStop script that uses curl to call the K8s API to fetch the current number of replicas for the StatefulSet it belongs to, and if the current number is less than its node id, gracefully leave the cluster, otherwise don't. (Ok, I see curl isn't in the rqlite container, so maybe a sidecar that uses docker.io/curlimages/curl.) That's clever, but I'll need to chew on this for a bit to decide if it's too clever.

jtackaberry commented 10 months ago

Last commit adds support for read-only nodes.

-raft-cluster-remove-shutdown=true would handle this for scale-down operations. Of course, it does mean, as you say, the readonly node will leave the cluster even on standard restarts. But is this a problem?

I included -raft-cluster-remove-shutdown=true for the readonly pods as I couldn't see any obvious downside. With cursory testing, dynamic scaling up and down has behaved flawlessly. Although I'd like to do a bit more testing with heavy mutation load during scale events to ensure readonly nodes rejoining the cluster don't serve up stale data from their previous membership.

That is to say, verify the readiness probe isn't reporting ready before the node completes syncing data from the leader. (I did check this hastily and it seemed ok but I wasn't driving any real load.) I know there's the freshness parameter but as I understand it that's more a knob to tune the window of eventual consistency in the face of network blips than it is to deal with a node rejoining a cluster after a day (or whatever).

The only thing left on my TODO list is for the chart to provide a mechanism for generating peers.json as a way to recover from permanent node loss, per the procedure documented at https://rqlite.io/docs/clustering/general-guidelines/#dealing-with-failure

jtackaberry commented 10 months ago

The only thing left on my TODO list is for the chart to provide a mechanism for generating peers.json as a way to recover from permanent node loss

This is now done, via the useStaticPeers chart option.

I think this is all fairly baked at this point, so I'm removing draft status from the PR. I'll defer the backup functionality to post 1.0, because it's technically possible now with extraArgs.

jtackaberry commented 10 months ago

Actually retagging as draft: it occurs to me that the procedure I documented for shrinking the number of voters has problems.

The procedure was to first use the API to remove the node, then stop the pods. The problem with this is it will take up to 20 seconds for the removed node(s) to fail their readiness probe, during which time they will still receive requests via the K8s Service, and return 503s.

For that to work, we'd need some way to first mark the nodes as soft-down, such that they fail the readness probe to get removed from the K8s Service (and therefore stop taking new traffic), but continue to service in-flight transactions. Then, once removed from the K8s Service, they can be removed from the cluster.

So instead I guess the right order is to first shutdown the nodes, then remove them. The main problems with this are:

  1. If you want to shutdown more than ceiling(N/2)-1 nodes, you need to do them in batches of at most ceiling(N/2)-1 nodes or else the decom process will cause loss of quorum.
  2. The procedure will generate a lot of noise in the logs (nodes being shutdown before being removed)
  3. I'm passing the value of replicaCount to -bootstrap-expect which means when the replicaCount changes, all nodes in the cluster roll. This is really suboptimal for a decom procedure where you really don't want the cluster churning six ways from Sunday.

I think 1 and 2 are tolerable. 3 requires a change to how I'm passing -bootstrap-expect. I can build on the trick I used for useStaticPeers (in which I use a ConfigMap value to trigger the behavior rather than changing the pod spec, where the latter case will trigger a rollout restart whereas the former won't).

Edit: believe I have this sorted now.

otoolep commented 10 months ago

Thanks for this. When it merges, I'll add some notes about it to rqlite.io.

jtackaberry commented 10 months ago

Thanks for this. When it merges, I'll add some notes about it to rqlite.io.

Sounds great. I'll merge it later today or tomorrow after a bit more testing on different cloud providers.

jtackaberry commented 10 months ago

Verified basic deployment on EKS, LKE, and microk8s clusters. Looks good.

Although I'd like to do a bit more testing with heavy mutation load during scale events to ensure readonly nodes rejoining the cluster don't serve up stale data from their previous membership.

This worked well. Ran a test where I:

  1. scaled the readonly pool up to 7 replicas
  2. set a numeric column in a table to 42
  3. scaled the readonly pool down to 1 replica
  4. began continuously looping updates to set that column to unixepoch("subsecond")
  5. had another client continuously pulling this value from the readonly K8s service endpoint with level=none
  6. scaled up the readonly pool to to 4 replicas, then a couple minutes later up to 7 replicas, reusing the same PV from before with the older data
  7. verified no output from step 5 showed the stale value 42

I assume this shouldn't be a surprise to you but I just wanted to prove it to myself. :)

I'll update the chart version to 1.0.0 and merge to master in a bit.