hashicorp / nomad

Nomad is an easy-to-use, flexible, and performant workload orchestrator that can deploy a mix of microservice, batch, containerized, and non-containerized applications. Nomad is easy to operate and scale and has native Consul and Vault integrations.
https://www.nomadproject.io/
Other
15k stars 1.96k forks source link

Problems with connection to Nomad master (non-leader) causes all allocations restart #17974

Open beninghton opened 1 year ago

beninghton commented 1 year ago

Nomad version

Nomad v1.5.6 BuildDate 2023-05-19T18:26:13Z Revision https://github.com/hashicorp/nomad/commit/8af70885c02ab921dedbdf6bc406a1e886866f80

Cluster structure

3 master nodes: 10.1.15.21 - leader 10.1.15.22 10.1.15.23 2 client nodes: 10.1.15.31 10.1.15.32 3 consul cluster nodes: 10.1.15.11 10.1.15.12 10.1.15.13

Operating system and Environment details

Fedora release 35 (Thirty Five)

Issue

Issue is related to https://github.com/hashicorp/nomad/issues/17973. In the https://github.com/hashicorp/nomad/issues/17973 issue, after our leader Node1 had had CSI/cpu/mem problems we initially rebooted it. Then cluster lost leadership and Node2 became new leader. Cluster worked fine. Then Node1 was back online and joined the cluster, but it was CSI corrupted, and hanged right after joining the cluster. Then our cluster lost leadership. New leader was not elected. But this time client nodes were not down, and all allocations on the whole cluster were restarted.

Reproduction steps

After we removed CSI the only way to reproduce the issue quickly is to block 4647 port on non-leader node: iptables -A INPUT -p tcp --destination-port 4647 -j DROP We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of a node. In this case we block nomad-server2 (non-leader).

Expected Result

We expected cluster to not fail. Two left nodes are fine, one of them is leader. Allocations on client nodes are not restarted.

Actual Result

New leader was not elected, all allocations on client nodes were restarted. nomad server members output on leader node:

nomad-server-1.global  10.1.15.21  4648  alive   true    3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

nomad server members output on non-leader node nomad-server-2 (where 4647 is blocked):

nomad-server-1.global  10.1.15.21  4648  alive   false   3             1.5.6  dc1         global
nomad-server-2.global  10.1.15.22  4648  alive   false   3             1.5.6  dc1         global
nomad-server-3.global  10.1.15.23  4648  alive   false   3             1.5.6  dc1         global

Error determining leaders: 1 error occurred:
        * Region "global": Unexpected response code: 500 (rpc error: failed to get conn: rpc error: lead thread didn't get connection)

Nomad logs

client1.log client2.log server1-leader.log server2.log server3.log

beninghton commented 1 year ago

@tgross btw I tested this case with 5 nomad server cluster, and the behaviour is exactly the same. I think that's not acceptable at all. 5 server cluster should be alive even after 2 nodes failure. I guess we will see the same result on any number of server nodes. It does not make sense to add more nodes to provide better High Availability because it's not right HA cluster behaviour.

tgross commented 1 year ago

ref https://github.com/hashicorp/nomad/issues/18063#issuecomment-1653593278

beninghton commented 6 months ago

We still have this issue, we have nomad 1.7.2. If one of masters (even not leader) becomes unavailable, then few client nodes are lost in the cluster. They go to DOWN state (we see in Cluster/Clients UI section). Their allocations go to "LOST" status, and start on another client nodes. Then these clients return to the cluster (becomes READY again) after some seconds.

But I don't like that allocations are lost and restarted if one of masters is down. It's not appropriate, not HA consept. Losing one master out of 3 should be unnoticable for allocations in the cluster, I guess? @tgross please advice maybe it's not a bug but expected behavior? Or maybe we have misconfigurations on our nomad cluster that lead to this situation? As I see this situation - part of clients are bounded to specific master, and if this master is unavailable, they lost it and become down for short time, then they reconnect to another master. We don't point master servers IP anywhere, we use consul and, we have consul cluster servers listed in "retry_join" section of consul clients.

I've attached log file of one of the lost clients. Master server that was down is 10.1.13.203. client lost allocations logs.txt master leader logs.txt

Client config: nomad_client_hcl2.txt nomad_client_hcl.txt consul_client_hcl.txt nomad_master_hcl.txt

jrasell commented 6 months ago

Hi @beninghton and thanks for providing the updated information which I have taken a look over. The configuration looks OK and there is nothing that jumps out as being an item that could be changed or tweaked.

It's not appropriate, not HA consept. Losing one master out of 3 should be unnoticable for allocations in the cluster, I guess? please advice maybe it's not a bug but expected behavior? Or maybe we have misconfigurations on our nomad cluster that lead to this situation?

I apologise you are finding your current experience is not appropriate, however, Nomad is designed to run in HA and the situation you are experiencing is not expected behaviour. It is also not behaviour I have been able to reproduce in my lab where I have 3 servers and 2 clients. The logs from my client whose server connection is broken can be seen below, which shows a graceful failover to a new connection. The allocations can also be seen, showing they have not been interrupted.

{"@caller":"github.com/hashicorp/nomad/client/rpc.go:119","@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-05-10T08:23:54.287409+01:00","error":"rpc error: No cluster leader","rpc":"Node.UpdateStatus","server":{"IP":"192.168.1.110","Port":4647,"Zone":""}}
{"@caller":"github.com/hashicorp/nomad/client/rpc.go:136","@level":"error","@message":"error performing RPC to server, deadline exceeded, cannot retry","@module":"client.rpc","@timestamp":"2024-05-10T08:23:54.288209+01:00","error":"rpc error: No cluster leader","rpc":"Node.UpdateStatus"}
{"@caller":"github.com/hashicorp/nomad/client/client.go:1754","@level":"error","@message":"error heartbeating. retrying","@module":"client","@timestamp":"2024-05-10T08:23:54.288393+01:00","error":"failed to update status: rpc error: No cluster leader","period":1033135722}
{"@caller":"github.com/hashicorp/nomad/client/client.go:3031","@level":"error","@message":"error discovering nomad servers","@module":"client","@timestamp":"2024-05-10T08:23:54.289604+01:00","error":"client.consul: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": dial tcp 127.0.0.1:8500: connect: connection refused"}
{"@caller":"github.com/hashicorp/nomad/client/client.go:2345","@level":"debug","@message":"received stale allocation information; retrying","@module":"client","@timestamp":"2024-05-10T08:25:19.618068+01:00","index":53,"min_index":53}
{"@caller":"github.com/hashicorp/nomad/client/servers/manager.go:208","@level":"debug","@message":"new server list","@module":"client.server_mgr","@timestamp":"2024-05-10T08:26:05.651876+01:00","new_servers":[{"Addr":{"IP":"192.168.1.111","Port":4647,"Zone":""}},{"Addr":{"IP":"192.168.1.112","Port":4647,"Zone":""}}],"old_servers":[{"Addr":{"IP":"192.168.1.112","Port":4647,"Zone":""}},{"Addr":{"IP":"192.168.1.110","Port":4647,"Zone":""}},{"Addr":{"IP":"192.168.1.111","Port":4647,"Zone":""}}]}
Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created    Modified
04a271b9  79087648  cache       0        run      running  1h35m ago  1h35m ago
0a7d8b37  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
10e357be  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
3658aacb  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
386bd4b9  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
55902b4c  79087648  cache       0        run      running  1h35m ago  1h35m ago
7e49eedf  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
969ebf84  79087648  cache       0        run      running  1h35m ago  1h35m ago
9bc10a9d  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
b1e99875  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago
c6b260c1  79087648  cache       0        run      running  1h35m ago  1h35m ago
cfc07965  7c11d6d3  cache       0        run      running  1h35m ago  1h35m ago

I have taken a look through the logs you provided, the server logs do seem to contain anything I wouldn't expect, given the situation you have described. The client logs have the interesting few lines:

2024-05-10T10:26:37+07:00       2024-05-10T03:26:37.396Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateAlloc server=10.1.13.203:4647
2024-05-10T10:26:37+07:00       2024-05-10T03:26:37.396Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.1.13.203:4647
2024-05-10T10:26:37+07:00       2024-05-10T03:26:37.400Z [INFO]  client.consul: discovered following servers: servers=[10.1.13.203:4647, 10.1.13.202:4647, 10.1.13.201:4647]
2024-05-10T10:26:37+07:00       2024-05-10T03:26:37.404Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: node cb67e458-6c40-d4d3-1c55-582fa5680a0c is not allowed to update allocs while in status down" rpc=Node.UpdateAlloc

It shows that the client -> server RPC fails when the server goes down and the client performs a discovery of the servers to allow it to connect to a different server. Immediately when connecting to next server, the RPC is rejected as the client state held indicates the client is in a down.

Do you have the leader logs in debug/info mode when the issue occurred, so that I can see what happened leading up to the failure? Do you have network monitoring in-place, so we could disregard potential issues here?

pavel-z1 commented 6 months ago

Hi @jrasell I have the same issue as beninghton

If RPC on Nomad Leader unavailable during interval 15-30 seconds (random delay during this interval) Nomad Clients stop and then start all allocations in Nomad cluster.

If I send down Nic interface on VM that has Nomad Leader role for interval 15-30 seconds issue is not occurs. Leader role move to another nomad server siding 1-4 seconds. If I stop nomad service with leader role issue is not occurs. Leader role move to another nomad server siding 1-4 seconds.

Issue occurs only if temporary not available nomad RPC port during interval 15-30 seconds on Nomad Leader. At least this scenario consistently reproduces the problem. It is possible that the problem can be reproduced under other circumstances.

My stage Nomad cluster has 3 servers and 4 clients (2 clients with node_pool=stg and 2 clients with node_pool=dev) In this debug will be used node_pool=stg Operation system - Rocky Linux release 8.7

Steps to reproduce issue:

  1. Check members and allocations:
    
    # nomad server members
    Name                    Address      Port  Status  Leader  Raft Version  Build  Datacenter  Region
    stg-nomad-server-01.eu  10.61.0.110  4648  alive   false   3             1.6.3  ams02       eu
    stg-nomad-server-02.eu  10.61.0.108  4648  alive   true    3             1.6.3  ams02       eu
    stg-nomad-server-03.eu  10.61.0.109  4648  alive   false   3             1.6.3  ams02       eu

nomad node status

ID Node Pool DC Name Class Drain Eligibility Status e872ca8b stg ams02 stg-nomad-client-02 stg false eligible ready 478a04d3 dev ams02 dev-nomad-client-02 dev false eligible ready 3690048f dev ams02 dev-nomad-client-01 dev false eligible ready d1c59226 stg ams02 stg-nomad-client-01 stg false eligible ready

nomad node status -allocs d1c59226

ID = d1c59226-9727-6ef7-a05f-d8acd8f7293c


Allocations ID Node ID Task Group Version Desired Status Created Modified 3e749103 d1c59226 panel_api_ams02 0 run running 3m33s ago 2m59s ago


2. Drop network connection to port 4647 on 20 seconds (server `stg-nomad-server-02`)

iptables -A INPUT -p tcp --destination-port 4647 -j DROP; sleep 20; iptables -D INPUT -p tcp --destination-port 4647 -j DROP


3. After step 2 checking allocation status in interval 1 seconds on nomad client `d1c59226`

Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run pending 1s ago 1s ago 3e749103 d1c59226 panel_api_ams02 0 stop lost 4m2s ago 3s ago

Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run pending 2s ago 1s ago 3e749103 d1c59226 panel_api_ams02 0 stop running 4m3s ago 1s ago

Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run pending 3s ago 2s ago 3e749103 d1c59226 panel_api_ams02 0 stop running 4m4s ago 2s ago

Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run running 5s ago 1s ago 3e749103 d1c59226 panel_api_ams02 0 stop running 4m6s ago 4s ago



As we see nomad client triggered allocation stop and then start.
I hope that my debug information will help you to reproduce issue.

Fixing this problem is critically important for us since an unexpected drop in services in the cluster prevents us from using Nomad in our environment.

Nomad configs:
[nomad-server_config.txt](https://github.com/hashicorp/nomad/files/15385933/nomad-server_config.txt)
[nomad-client_config.txt](https://github.com/hashicorp/nomad/files/15386002/nomad-client_config.txt)

Nomad Logs with debug level:
[nomad-server-3.log](https://github.com/hashicorp/nomad/files/15385946/nomad-server-3.log)
[nomad-server-2_leader.log](https://github.com/hashicorp/nomad/files/15385948/nomad-server-2_leader.log)
[nomad-server-1.log](https://github.com/hashicorp/nomad/files/15385949/nomad-server-1.log)
[nomad-client-1.log](https://github.com/hashicorp/nomad/files/15385951/nomad-client-1.log)
jrasell commented 6 months ago

Hi @pavel-z1 and thanks for adding your context and reproduction. I have read through the issue a couple more times this morning and think I have a better idea of what is happening. I will move this onto our backlog and raise it internally; once it's picked up, engineers will post comments regarding the progress.

beninghton commented 6 months ago

@jrasell I've reproduced it again for you, I just stopped one of our servers (10.1.13.201), which is not leader. Here I'm providing you with DEBUG logs from two other servers (10.1.13.202-non-leader and 10.1.13.203-leader) and one of the clients who lost all his allocations in the moment 10.1.13.201 was stopped. client.log server2-10.1.13.202-non-leader.log server3-10.1.13.203-leader.log

Our network is fine btw, we don't see any issues with it, it's 10G channel between them and we see no errors on the connection.

pavel-z1 commented 2 months ago

Hi @jrasell , @tgross I've upgraded nomad to version v1.8.3 Issue still exists.

This issue lead to unstable state of whole nomad cluster when network flapping cases occurs only on one nomad master server. Maybe Nomad team can focus attention on this issue with higher priority to fix it?