Closed zizon closed 2 years ago
Hi @zizon :wave:
I'm sorry to hear you went through an experience like this but, unfortunately, the Raft replication process worked as designed here. This scenario is similar to if you were to add a test database server to a production cluster with data replication enabled: your production data would be overwritten with test data.
In terms of how we can improve Nomad to avoid situations like this, preventing the agent region to change if there's any state stored in the node could be a good safety check. I opened https://github.com/hashicorp/nomad/issues/14444 to track this enhancement.
We can also improve our documentation to highlight that region
(and potentially other) configuration fields should not be changed. Explaining how to use the nomad operator snapshot save
and nomad operator snapshot restore
to backup data could be useful as well. https://github.com/hashicorp/nomad/pull/14443 adds this warning to region
.
I linked back to this issue as you have provided great detail and information, but since this was not caused by a bug in Nomad I will go ahead and close it. Feel fee to reach out if you have any other questions.
I agree that it is not a bug for raft but there are bigger issues under this design/usage flaw. I will disclose more on #14444 .
I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.
Nomad version
Output from
nomad version
Nomad v1.2.6Operating system and Environment details
Issue
Moving a raft leader under region A to another region B, by changing region field in config file, causing region B raft state being overwrite by A thus causing all running state in region B being lost.
We had encounter this severe disaster recently. For brief summery, we had,
I will attach the associated log of each latter at bottom.
The root cause of such disaster is in involve a operation out of standard, but maybe we should revise the raft restore procedure.
Reproduction steps
Expected Result
Actual Result
Job file (if appropriate)
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)