hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.61k stars 1.92k forks source link

keyring: replication tries to replicate rotated-away keys #19367

Closed tgross closed 1 day ago

tgross commented 7 months ago

In https://github.com/hashicorp/nomad/issues/19340 @sbihel reported a behavior where the followers would try to replicate keys that had been previously rotated out, and this would fail:

[WARN] nomad.keyring.replicator: failed to fetch key from current leader, trying peers: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2 error= [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2 error="rpc error: no such key \"128ba7c1-baa0-3bc6-c20f-833b97a1fbe2\" in keyring" [ERROR] nomad.keyring.replicator: failed to fetch key from any peer: rpc error: no such key "128ba7c1-baa0-3bc6-c20f-833b97a1fbe2" in keyring: key=128ba7c1-baa0-3bc6-c20f-833b97a1fbe2

19340 covered another critical bug and was automatically closed once the fix was merged. This issue is a follow-up.

tgross commented 7 months ago

The specific error we're getting here is when the server we're replicating the key from tries to get the key material from its keyring. That key material isn't present anymore so the replication can't work anymore. That's not an unexpected scenario by itself, because we have to handle that for when we want to bootstrap the keyring from one server to all the other servers (and some servers may get replication requests for keys they don't yet have).

But for what is effectively an "orphaned" key, we're in a messy spot. We can't guarantee that the key is safe to remove from the metadata, because the operator may have had a bad recovery process and needs to restore the on-disk keyring to the servers. As a workaround, the operator can remove the key via nomad operator root keyring remove if they know it's truly orphaned. But being able to fix https://github.com/hashicorp/nomad/issues/19368 seems important to figure out to fix this issue.

tgross commented 6 months ago

Ref https://github.com/hashicorp/nomad/issues/19669

tgross commented 1 day ago

I've done some testing and I believe this will be resolved by the work done in https://github.com/hashicorp/nomad/pull/23577. I'm going to close this issue out.