Invalid Vault token (403) in a Nomad client after recycling Nomad servers

adrianlop commented 1 month ago

Nomad version

v1.5.15+ent

Operating system and Environment details

Ubuntu 22.04 - AWS EC2 instances

Issue

It looks like we've hit a bug where a nomad client starts receiving 403s from Vault when we're in the middle of recycling the nomad servers (3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one). This has happened twice already in our Production systems recently.

Reproduction steps

Cluster with 3 servers and multiple nomad clients with Vault integration enabled - NOTE: only 1 client is affected (in a pool of 84 clients)
Add 3 extra server nodes to the cluster
immediately after the new nodes join (and one of them is promoted automatically via autopilot), the client misses a heartbeat with a timeout: client logs:
```
[Oct 09, 2024 at 9:43:56.188 pm]
```

client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" rpc=Node.UpdateStatus server=10.181.3.215:4647

client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647 client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647

client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" period=1.617448288s

client.consul: discovered following servers: servers=[10.181.3.134:4647, 10.181.3.215:4647, 10.181.2.84:4647, 10.181.1.241:4647, 10.181.1.177:4647, 10.181.2.12:4647]

client: missed heartbeat: req_latency=21.613428ms heartbeat_ttl=16.683772489s since_last_heartbeat=26.713400803s

agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Error making API request.

URL: GET https://vault.service.consul:8200/v1/secret/data/service/xx/yy Code: 403. Errors:

permission denied (retry attempt 1 after "250ms")

servers:

[Oct 09, 2024 at 9:43:47.731 pm] nomad.autopilot: Promoting server: id=a0498eba-bc93-76d2-be12-5477c3db9dfe address=10.181.3.215:4647 name=nomad-server-10-181-3-215.global

[Oct 09, 2024 at 9:43:56.227 pm] nomad.heartbeat: node TTL expired: node_id=a05735cd-8fa4-28bf-99cf-d160f6f73922

the "Promoting server" message I don't think means leader election since the rest of the logs indicate that other node acquires leadership later in the recycling process (5min later)
After that, the client will be rejected by Vault for all requests with 403s for 8+ minutes (so, even after the re-election has happened)
New servers finish registering in Consul
after the 3 old servers have left the cluster, the client no longer receives 403s from Vault.

Expected Result

Client should continue to operate normally when rolling nomad servers

Actual Result

Client is interrupted and receives 403s from Vault

tgross commented 1 month ago

Summarizing our internal discussion so far:

"(3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one)." you're risking split brain by doing this and should be bringing up new servers one at a time
v1.5.15+ent: this version of Nomad is out of support since May. Even 1.6.x is out of support as of this week. But you've stated you're intending on getting onto a new stable version soon.
The failed heartbeat went to the new server (.215), probably before it was ready to serve requests. That general problem is described in https://github.com/hashicorp/nomad/issues/15560 (and linked issues from there). It's not clear why it started talking to the new server. That could just have been a network glitch but I'd expect to see that in the logs.
The Vault request is coming from a template block (the agent (view) is the clue there).
In the legacy (non-Workload Identity) workflow, the client sends a request to the Nomad servers to get a Vault token. If that request was rejected I'd expect the allocation to fail rather than continue to run templates.
We'll need debug-level logs to debug this further.

adrianlop commented 1 month ago

noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests:

agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Get "https://10.181.2.119:8200/v1/secret/data/service/xx/yy": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.181.2.119 (retry attempt 1 after "250ms")

tgross commented 1 month ago

noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests

Can you clarify what "they" is here? Are you saying Nomad clients/servers(?) aren't ready for requests or the Vault servers aren't ready for requests?

adrianlop commented 1 month ago

sorry Tim, I shouldn't have mentioned Vault here, it's adding confusion.

what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as healthy even though they're ready to serve requests.

the nomad-clients are able to talk to this new server (via nomad.service.consul which will include the newly created node) that hasn't finished the initialization.

does this make sense @tgross?

tgross commented 1 month ago

what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as healthy even though they're ready to serve requests.

Yes, that's in the same general category of problem as https://github.com/hashicorp/nomad/issues/15560 and https://github.com/hashicorp/nomad/issues/18267.

adrianlop commented 1 month ago

@tgross is there anything else we can do externally to avoid issues? I was trying to implement a workaround by adding a consul check for the nomad local agent and have it marked as critical for a few iterations (so that clients can't reach newly joined servers), but I just found out that success_before_passing in Consul doesn't actually work they way it is expected to (see https://github.com/hashicorp/consul/issues/10864)

tgross commented 1 month ago

Unfortunately even if you could get the Consul health check to work as you expect, that wouldn't help here. Consul is only used for discovery on client start or if the client loses all servers somehow and has to start over. Once a client is connected to the cluster, it gets the list of servers from server's response to heartbeats, and not from Consul. That list consists of the local Raft peers. The client periodically reshuffles its copy of the list (every 5m) to spread load.

Something that comes to mind in terms of fixing this that might me smaller in scope than reworking server bring-up is to have the list of servers we return to the client be not just the local peers but those that autopilot says are ready. That'd need some investigation to verify feasibility.

But in any case, short of net-splitting the new servers when they come up, no there's no workaround currently. Using Workload Identity will help specifically for Vault because then we don't go to the server for Vault tokens, but doesn't help the general problem. This overall issues is a problem with all the Raft-based HashiCorp products, as it turns out, but Nomad is probably impacted the worst because of how much the client gets canonical status from the servers.

hashicorp / nomad