Closed jtackaberry closed 10 months ago
@otoolep I also wanted to ask, is there a reason not to use -raft-cluster-remove-shutdown=true
by default?
@otoolep I also wanted to ask, is there a reason not to use -raft-cluster-remove-shutdown=true by default?
There isn't a strong correlation between shutting down a node, and wanting to actually remove it from the cluster IMHO. Removing a node from a cluster changes the quorum requirements, and that's not something that should happen by default. It should only happen if someone explicitly requests this mode. I believe this is the way Consul works.
There isn't a strong correlation between shutting down a node, and wanting to actually remove it from the cluster IMHO.
Sorry, my question as-phrased was a dumb one. I neglected to mention an important detail:
I was thinking about this specifically in the context of read-only nodes. I was contemplating how Horizontal Pod Autoscaling (HPA) would work with the readonly pool, once I implement it. Because supporting autoscaling for readonly replicas would be a nice capability.
HPA works basically equivalently to kubectl scale statefulset/rqlite --replicas=<n>
. When scaling up, readonly nodes will join the cluster just fine. When scaling down, per https://rqlite.io/docs/clustering/read-only-nodes/#handling-failure it will generate quite a lot of noise in the remaining nodes' logs (though AFAICT be otherwise harmless?).
-raft-cluster-remove-shutdown=true
would handle this for scale-down operations. Of course, it does mean, as you say, the readonly node will leave the cluster even on standard restarts. But is this a problem?
Although, in the context of voters, it does occur to me the chart should offer some means of implementing https://rqlite.io/docs/clustering/general-guidelines/#dealing-with-failure
More generally, if the user explicitly lowers the replicaCount
chart value, it's probably not unreasonable to assume they do want to jettison those nodes. I could generate peers.json
-- AFAICT everything needed to generate that file is known at chart deployment time (using FQDNs instead of IPs for address
of course) -- but I think this could be problematic in the context of readonly replicas and autoscaling them.
Once I get the readonly stuff implemented I'll experiment with that.
Alternatively, as suggested by the docs, I relegate peers.json
to "break glass in case of emergency" situations and the chart provides some other logic to support scaling voters down. This is trickier than it sounds, because it means inside the pod there needs to be a way for it to dynamically determine if it should leave the cluster on termination.
It's tricky but not impossible: I could have a custom preStop
script that uses curl
to call the K8s API to fetch the current number of replicas for the StatefulSet it belongs to, and if the current number is less than its node id, gracefully leave the cluster, otherwise don't. (Ok, I see curl
isn't in the rqlite container, so maybe a sidecar that uses docker.io/curlimages/curl
.) That's clever, but I'll need to chew on this for a bit to decide if it's too clever.
Last commit adds support for read-only nodes.
-raft-cluster-remove-shutdown=true would handle this for scale-down operations. Of course, it does mean, as you say, the readonly node will leave the cluster even on standard restarts. But is this a problem?
I included -raft-cluster-remove-shutdown=true
for the readonly pods as I couldn't see any obvious downside. With cursory testing, dynamic scaling up and down has behaved flawlessly. Although I'd like to do a bit more testing with heavy mutation load during scale events to ensure readonly nodes rejoining the cluster don't serve up stale data from their previous membership.
That is to say, verify the readiness probe isn't reporting ready before the node completes syncing data from the leader. (I did check this hastily and it seemed ok but I wasn't driving any real load.) I know there's the freshness
parameter but as I understand it that's more a knob to tune the window of eventual consistency in the face of network blips than it is to deal with a node rejoining a cluster after a day (or whatever).
The only thing left on my TODO list is for the chart to provide a mechanism for generating peers.json as a way to recover from permanent node loss, per the procedure documented at https://rqlite.io/docs/clustering/general-guidelines/#dealing-with-failure
The only thing left on my TODO list is for the chart to provide a mechanism for generating peers.json as a way to recover from permanent node loss
This is now done, via the useStaticPeers
chart option.
I think this is all fairly baked at this point, so I'm removing draft status from the PR. I'll defer the backup functionality to post 1.0, because it's technically possible now with extraArgs
.
Actually retagging as draft: it occurs to me that the procedure I documented for shrinking the number of voters has problems.
The procedure was to first use the API to remove the node, then stop the pods. The problem with this is it will take up to 20 seconds for the removed node(s) to fail their readiness probe, during which time they will still receive requests via the K8s Service, and return 503s.
For that to work, we'd need some way to first mark the nodes as soft-down, such that they fail the readness probe to get removed from the K8s Service (and therefore stop taking new traffic), but continue to service in-flight transactions. Then, once removed from the K8s Service, they can be removed from the cluster.
So instead I guess the right order is to first shutdown the nodes, then remove them. The main problems with this are:
replicaCount
to -bootstrap-expect
which means when the replicaCount changes, all nodes in the cluster roll. This is really suboptimal for a decom procedure where you really don't want the cluster churning six ways from Sunday.I think 1 and 2 are tolerable. 3 requires a change to how I'm passing -bootstrap-expect
. I can build on the trick I used for useStaticPeers
(in which I use a ConfigMap value to trigger the behavior rather than changing the pod spec, where the latter case will trigger a rollout restart whereas the former won't).
Edit: believe I have this sorted now.
Thanks for this. When it merges, I'll add some notes about it to rqlite.io.
Thanks for this. When it merges, I'll add some notes about it to rqlite.io.
Sounds great. I'll merge it later today or tomorrow after a bit more testing on different cloud providers.
Verified basic deployment on EKS, LKE, and microk8s clusters. Looks good.
Although I'd like to do a bit more testing with heavy mutation load during scale events to ensure readonly nodes rejoining the cluster don't serve up stale data from their previous membership.
This worked well. Ran a test where I:
42
unixepoch("subsecond")
level=none
42
I assume this shouldn't be a surprise to you but I just wanted to prove it to myself. :)
I'll update the chart version to 1.0.0 and merge to master in a bit.
Still in development, but fairly comprehensive already. Submitting a draft PR to begin the review process.
Remaining work items:
extraFiles
andextraArgs
but I want to investigate higher level abstractions)While going through active development, I decided to include the
develop
branch in the chart release action so that 0.x versions will get released, so that aspect can also be tested before moving to 1.0.0.There's a lot here, especially in
values.yaml
, so I'd be grateful for any corrections or suggestions on best practices.The template files themselves are probably less important to review, but feel free. Helm (Go+sprig) template syntax can be a bit arcane (and not in the good way).