Restoring snapshot removes current Nomad Clients, and clients don't re-join the cluster automatically

knanao commented 8 months ago

Nomad version

Nomad v1.6.8 or above

Issue

Before v1.5.15, restoring snapshot keeps current nomad Clients joined into the cluster. This means that All of the clients at the time the snapshot was taken and clients currently exist in the cluster are ready status temporally. After failover_heartbeat_ttl(default 5m) time, old clients are down state, and, jobs are reallocated to new clients. However, all new clients are removed and won't be reregistered automatically to the cluster when executing restoring snapshot in v1.6.8. In the result, all jobs were pending after old clients were down. This can be resolved by restarting clients, but unbalanced allocations to clients is inevitable. I couldn't verify it with all, but in v1.6.8 and v1.7.5 at least.

Reproduction steps

Run some jobs, save a nomad snapshot and take backup of the keystore folder.
Terminate all the nomad servers and clients and delete data_directory file each nodes.
Place the keystore's .nks.json file under data_directory/server/keystore for only one server which is started at first.
Restart all the nomad servers and clients and bootstrap a fresh cluster.
Restore the nomad snapshot to the cluster

Expected Result

The new nomad clients automatically join to the fresh cluster after restoring the snapshot, and all allocations will be rescheduled in to them.

Actual Result

The new nomad clients don't join the cluster, and the allocs are never rescheduled without restarting nomad clients.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

Feb 22 03:11:59 client-0 nomad[99209]:     2024-02-22T03:11:59.102Z [INFO]  client.consul: discovered following servers: servers=[192.168.2.8:4647, 192.168.2.5:4647, 192.168.2.4:4647]
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.278Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: rpc error: Permission denied" rpc=Node.UpdateStatus server=192.168.2.4:4647
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.278Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: rpc error: Permission denied" rpc=>
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.279Z [ERROR] client: error heartbeating. retrying: error="failed to update status: rpc error: rpc error: Permission denied" period=25.83388158s
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.280Z [DEBUG] client.consul: bootstrap contacting Consul DCs: consul_dcs=["dc1"]
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.283Z [INFO]  client.consul: discovered following servers: servers=[192.168.2.8:4647, 192.168.2.5:4647, 192.168.2.4:4647]
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.346Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: Permission denied" rpc=Node.GetClientAllocs server=192.168.2.8:4647
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.346Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: Permission denied" rpc=Node.GetCli>
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.346Z [ERROR] client: error querying node allocations: error="rpc error: Permission denied"
Feb 22 03:12:00 client-0 nomad[99209]:     2024-02-22T03:12:00.899Z [TRACE] consul.sync: execute sync: reason=periodic

tgross commented 8 months ago

Hi @knanao! Deleting the client datadir means that the client no longer has its node secret and has to recreate it from scratch. but the node ID potentially comes from a source of data on the host (a hash of the host ID), which means it'll get the same node ID but have a different node secret. So that leads to permissions errors.

You could get away with this prior to 1.6.0 because we didn't enforce the node secret as strongly as we should have. (See https://github.com/hashicorp/nomad/pull/16799)

tgross commented 8 months ago

Note that if you find yourself in this spot, you can purge the node via: https://developer.hashicorp.com/nomad/api-docs/nodes#purge-node That'll make the server forget about the client node and then the client node can be restarted and safely rejoin with its new secret.

hashicorp / nomad