hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
14.88k stars 1.95k forks source link

Invalid Vault token (403) in a Nomad client after recycling Nomad servers #24256

Open adrianlop opened 1 week ago

adrianlop commented 1 week ago

Nomad version

v1.5.15+ent

Operating system and Environment details

Ubuntu 22.04 - AWS EC2 instances

Issue

It looks like we've hit a bug where a nomad client starts receiving 403s from Vault when we're in the middle of recycling the nomad servers (3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one). This has happened twice already in our Production systems recently.

Reproduction steps

client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" rpc=Node.UpdateStatus server=10.181.3.215:4647

client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647 client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647

client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" period=1.617448288s

client.consul: discovered following servers: servers=[10.181.3.134:4647, 10.181.3.215:4647, 10.181.2.84:4647, 10.181.1.241:4647, 10.181.1.177:4647, 10.181.2.12:4647]

client: missed heartbeat: req_latency=21.613428ms heartbeat_ttl=16.683772489s since_last_heartbeat=26.713400803s

agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Error making API request.

URL: GET https://vault.service.consul:8200/v1/secret/data/service/xx/yy Code: 403. Errors:

servers:

[Oct 09, 2024 at 9:43:47.731 pm] nomad.autopilot: Promoting server: id=a0498eba-bc93-76d2-be12-5477c3db9dfe address=10.181.3.215:4647 name=nomad-server-10-181-3-215.global

[Oct 09, 2024 at 9:43:56.227 pm] nomad.heartbeat: node TTL expired: node_id=a05735cd-8fa4-28bf-99cf-d160f6f73922

Expected Result

Client should continue to operate normally when rolling nomad servers

Actual Result

Client is interrupted and receives 403s from Vault

tgross commented 1 week ago

Summarizing our internal discussion so far:

adrianlop commented 3 days ago

noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests:

agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Get "https://10.181.2.119:8200/v1/secret/data/service/xx/yy": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.181.2.119 (retry attempt 1 after "250ms")
tgross commented 3 days ago

noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests

Can you clarify what "they" is here? Are you saying Nomad clients/servers(?) aren't ready for requests or the Vault servers aren't ready for requests?

adrianlop commented 3 days ago

sorry Tim, I shouldn't have mentioned Vault here, it's adding confusion.

what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as healthy even though they're ready to serve requests.

the nomad-clients are able to talk to this new server (via nomad.service.consul which will include the newly created node) that hasn't finished the initialization.

does this make sense @tgross?

tgross commented 2 days ago

what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as healthy even though they're ready to serve requests.

Yes, that's in the same general category of problem as https://github.com/hashicorp/nomad/issues/15560 and https://github.com/hashicorp/nomad/issues/18267.

adrianlop commented 1 day ago

@tgross is there anything else we can do externally to avoid issues? I was trying to implement a workaround by adding a consul check for the nomad local agent and have it marked as critical for a few iterations (so that clients can't reach newly joined servers), but I just found out that success_before_passing in Consul doesn't actually work they way it is expected to (see https://github.com/hashicorp/consul/issues/10864)

tgross commented 1 day ago

Unfortunately even if you could get the Consul health check to work as you expect, that wouldn't help here. Consul is only used for discovery on client start or if the client loses all servers somehow and has to start over. Once a client is connected to the cluster, it gets the list of servers from server's response to heartbeats, and not from Consul. That list consists of the local Raft peers. The client periodically reshuffles its copy of the list (every 5m) to spread load.

Something that comes to mind in terms of fixing this that might me smaller in scope than reworking server bring-up is to have the list of servers we return to the client be not just the local peers but those that autopilot says are ready. That'd need some investigation to verify feasibility.

But in any case, short of net-splitting the new servers when they come up, no there's no workaround currently. Using Workload Identity will help specifically for Vault because then we don't go to the server for Vault tokens, but doesn't help the general problem. This overall issues is a problem with all the Raft-based HashiCorp products, as it turns out, but Nomad is probably impacted the worst because of how much the client gets canonical status from the servers.