hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.78k stars 1.94k forks source link

client RPC fails to validate new ACL token #17834

Open tgross opened 1 year ago

tgross commented 1 year ago

We had an internal report of a problem where the Restart Allocation API would sporadically fail with the error "Unexpected response code: 403 (ACL token not found)". After some investigation, we learned that the ACL token was obtained from Vault's Nomad secrets engine shortly before the API call.

The Restart Allocation API is a "client RPC" that gets forwarded from the server to the client for execution locally. The handlers for these RPCs resolve the auth token as either an ACL token or a Workload Identity in resolveTokenValue (ref client/acl.go#L139-L178), which checks a local cache first and if the token isn't there (or the cache is expired) calls the ACL.WhoAmI RPC so the server can validate the token. This RPC is make with AllowStale = true, which means it can be served by followers without being forwarded to the leader. That was done to reduce leader load way back in https://github.com/hashicorp/nomad/commit/e9790c63b41428f3912838f9ff216d5f4307f7c6 which shipped in Nomad 0.6.3.

This appears to create a narrow race condition:

The only way this sequence of events seems plausible is if server B has just joined the cluster and is serving RPCs before it has completed its snapshot (see also https://github.com/hashicorp/nomad/issues/15560), or if one server is lagging far behind in replication.

The temporary workaround for our internal user is to ensure that the token written by Vault has had time to propogate before using it; the workload is being automated and is fairly aggressive about issuing a new token and then immediately using it. We're also asking the internal user to verify that the ACL token hasn't been revoked by Vault too early and check with the infra team that replication is healthy.

ncode commented 10 months ago

Adding to the report here. We've experienced this problem also with a Nomad Global token issued from the primary DC via Vault, for this case we need to wait for about 60 seconds to replicate the token overseas as example. On my case it could be an expected behavior due to the latency involved, but I couldn't find any documentation about it.