Open adrianlop opened 1 month ago
Summarizing our internal discussion so far:
.215
), probably before it was ready to serve requests. That general problem is described in https://github.com/hashicorp/nomad/issues/15560 (and linked issues from there). It's not clear why it started talking to the new server. That could just have been a network glitch but I'd expect to see that in the logs.template
block (the agent (view)
is the clue there).noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests:
agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Get "https://10.181.2.119:8200/v1/secret/data/service/xx/yy": tls: failed to verify certificate: x509: certificate is valid for 127.0.0.1, not 10.181.2.119 (retry attempt 1 after "250ms")
noticed that Nomad also reaches out to newly created Vault servers when they are still joining the cluster and aren't ready for requests
Can you clarify what "they" is here? Are you saying Nomad clients/servers(?) aren't ready for requests or the Vault servers aren't ready for requests?
sorry Tim, I shouldn't have mentioned Vault here, it's adding confusion.
what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as healthy
even though they're ready to serve requests.
the nomad-clients are able to talk to this new server (via nomad.service.consul
which will include the newly created node) that hasn't finished the initialization.
does this make sense @tgross?
what we noticed is that Nomad servers (and this is happening to our Vault servers too, but it's a different issue) will join Consul and will report themselves as healthy even though they're ready to serve requests.
Yes, that's in the same general category of problem as https://github.com/hashicorp/nomad/issues/15560 and https://github.com/hashicorp/nomad/issues/18267.
@tgross is there anything else we can do externally to avoid issues?
I was trying to implement a workaround by adding a consul check for the nomad local agent and have it marked as critical for a few iterations (so that clients can't reach newly joined servers), but I just found out that success_before_passing
in Consul doesn't actually work they way it is expected to (see https://github.com/hashicorp/consul/issues/10864)
Unfortunately even if you could get the Consul health check to work as you expect, that wouldn't help here. Consul is only used for discovery on client start or if the client loses all servers somehow and has to start over. Once a client is connected to the cluster, it gets the list of servers from server's response to heartbeats, and not from Consul. That list consists of the local Raft peers. The client periodically reshuffles its copy of the list (every 5m) to spread load.
Something that comes to mind in terms of fixing this that might me smaller in scope than reworking server bring-up is to have the list of servers we return to the client be not just the local peers but those that autopilot says are ready. That'd need some investigation to verify feasibility.
But in any case, short of net-splitting the new servers when they come up, no there's no workaround currently. Using Workload Identity will help specifically for Vault because then we don't go to the server for Vault tokens, but doesn't help the general problem. This overall issues is a problem with all the Raft-based HashiCorp products, as it turns out, but Nomad is probably impacted the worst because of how much the client gets canonical status from the servers.
Nomad version
v1.5.15+ent
Operating system and Environment details
Ubuntu 22.04 - AWS EC2 instances
Issue
It looks like we've hit a bug where a nomad client starts receiving 403s from Vault when we're in the middle of recycling the nomad servers (3 node cluster -> we spin up 3 new servers, and then slowly shut the old ones down one by one). This has happened twice already in our Production systems recently.
Reproduction steps
client.rpc: error performing RPC to server: error="rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" rpc=Node.UpdateStatus server=10.181.3.215:4647
client.rpc: error performing RPC to server: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647 client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: failed to get conn: rpc error: lead thread didn't get connection" rpc=Node.GetClientAllocs server=10.181.3.215:4647
client: error heartbeating. retrying: error="failed to update status: rpc error: failed to get conn: dial tcp 10.181.3.215:4647: i/o timeout" period=1.617448288s
client.consul: discovered following servers: servers=[10.181.3.134:4647, 10.181.3.215:4647, 10.181.2.84:4647, 10.181.1.241:4647, 10.181.1.177:4647, 10.181.2.12:4647]
client: missed heartbeat: req_latency=21.613428ms heartbeat_ttl=16.683772489s since_last_heartbeat=26.713400803s
agent: (view) vault.read(secret/data/service/xx/yy): vault.read(secret/data/service/xx/yy): Error making API request.
URL: GET https://vault.service.consul:8200/v1/secret/data/service/xx/yy Code: 403. Errors:
servers:
the "Promoting server" message I don't think means leader election since the rest of the logs indicate that other node acquires leadership later in the recycling process (5min later)
After that, the client will be rejected by Vault for all requests with 403s for 8+ minutes (so, even after the re-election has happened)
New servers finish registering in Consul
after the 3 old servers have left the cluster, the client no longer receives 403s from Vault.
Expected Result
Client should continue to operate normally when rolling nomad servers
Actual Result
Client is interrupted and receives 403s from Vault