Troubleshooting guide for meta node split brain - Githubissues

influxdata / docs-v2

InfluxData Documentation that covers InfluxDB Cloud, InfluxDB OSS 2.x, InfluxDB OSS 1.x, InfluxDB Enterprise, Telegraf, Chronograf, Kapacitor, and Flux.

https://docs.influxdata.com

MIT License

73 stars 275 forks source link

Troubleshooting guide for meta node split brain #3340

Open lesam opened 3 years ago

lesam commented 3 years ago

In InfluxDB Enterprise, we recently had an issue with a meta node split brain. This is an exceptional state that we believe is caused by incorrect steps while performing operational maintenance tasks on the meta cluster (e.g. incorrect procedures when migrating the meta nodes to new hosts).

Detailed steps for how to resolve can be found here:https://github.com/influxdata/plutonium/issues/3677#issuecomment-957875104 . It would be nice to publish these as a troubleshooting guide for InfluxDB Enterprise.

danatinflux commented 3 years ago

I wholeheartedly support this. :)

lesam commented 3 years ago

For another recent customer, it's not actually a split brain, instead they have data corruption on two meta nodes out of three and needed to kill the bad nodes and refresh the cluster from the good one. The diagnosis is different but the resolution is the same.

lesam commented 3 years ago

Diagnosis for the other issue (non-split-brain case):

One meta node appears to be good (comes up successfully), but other meta nodes have some problem that prevents them from coming up (bad machine environment, disk corruption, etc), so the good node cannot assume leadership as it does not have enough votes.

lesam commented 3 years ago

Solution for some meta nodes being good and some 'bad' by not coming up is the same as some being 'bad' because of a split brain (the resolution steps are linked above, but basically kill the bad nodes, make sure the good one(s) elect a leader, clean out the bad state, and re-join fresh nodes to the meta cluster to get desired availability).

lesam commented 3 years ago

Also when we formalize this doc, Add the "-single-server" argument to that meta node temporarily especially needs some explanation. For many customers this probably means overriding ExecStart in their systemd config, which can be tricky.