Open lesam opened 3 years ago
I wholeheartedly support this. :)
For another recent customer, it's not actually a split brain, instead they have data corruption on two meta nodes out of three and needed to kill the bad nodes and refresh the cluster from the good one. The diagnosis is different but the resolution is the same.
Diagnosis for the other issue (non-split-brain case):
One meta node appears to be good (comes up successfully), but other meta nodes have some problem that prevents them from coming up (bad machine environment, disk corruption, etc), so the good node cannot assume leadership as it does not have enough votes.
Solution for some meta nodes being good and some 'bad' by not coming up is the same as some being 'bad' because of a split brain (the resolution steps are linked above, but basically kill the bad nodes, make sure the good one(s) elect a leader, clean out the bad state, and re-join fresh nodes to the meta cluster to get desired availability).
Also when we formalize this doc, Add the "-single-server" argument to that meta node temporarily
especially needs some explanation. For many customers this probably means overriding ExecStart in their systemd config, which can be tricky.
In InfluxDB Enterprise, we recently had an issue with a meta node split brain. This is an exceptional state that we believe is caused by incorrect steps while performing operational maintenance tasks on the meta cluster (e.g. incorrect procedures when migrating the meta nodes to new hosts).
Detailed steps for how to resolve can be found here:https://github.com/influxdata/plutonium/issues/3677#issuecomment-957875104 . It would be nice to publish these as a troubleshooting guide for InfluxDB Enterprise.