Prevent an agent to start if it has data from another region

lgfa29 commented 2 years ago

Proposal

A Nomad region defines the data boundary of servers. Servers in the same region replicate their data among each other with the leader being the source truth, meaning that followers will always replace their own local state with the leader state.

If a server starts in one region, but then its region configuration value is changed without removing the data from the previous region, this server may be elected the cluster leader and push its (incorrect) data onto the other servers, overwriting their local state.

Nomad should prevent an agent region to change if it doesn't match the local data. This could be potentially be done in a few different ways. For example, Nomad could store the region value in ClusterMetadata when the cluster is initialized. This value would then be compared against the configuration value and the agent would not start unless they match.

Use-cases

Prevent accidental data loss in scenarios like the one described in https://github.com/hashicorp/nomad/issues/14429.

zizon commented 2 years ago

According to #14429 , the fundamental flaw is that, raft start without verifying peers. Nomad use/start serf to dissever raft peers after restore and start raft consensus.

It maybe better to add/accept raft peers via verifying configs in dissevered serf.

If we did not verify peer legitimate, it turns out to be an attack plane for Nomad, especially for non-TLS enabled RPC setting, since one and mimicked a raft peer to trigger such overwrite.

With verifying couple with serf, as it has its own forced token base transport which will enforce a bit more trusting, still lack of protections.

Image A malicious user with job submission capability, and managed to run task on nodes with plain text token reveal in config files, It can still craft to do various things a legal server can do. Even with TLS enabled, it can still possible to gain keys via such approach.

lgfa29 commented 2 years ago

Hi @zizon 👋 ,

mTLS and Serf encryption are the core of Nomad's security model. A cluster that doesn't have them properly configured is exposed to several attack vectors and is, therefore, not considered production ready.

In your scenario, a server with bad data would only be able to join the cluster if it has the proper Serf encryption key and the Nomad server mTLS certificate setup. Exploits that require this level of cluster access are not part of our security model since an attacker with this information would have full access to the cluster and can cause a lot more damage.

But if you were able to cause data loss on a production-ready cluster where the agent joining did not have access to mTLS or Serf encryption, we would be very interested in learning more about it. Since this is a security concern please send any extra information to security@hashicorp.com.

Thanks!

hashicorp / nomad

Prevent an agent to start if it has data from another region #14444

Proposal

Use-cases