Open beninghton opened 1 year ago
@tgross btw I tested this case with 5 nomad server cluster, and the behaviour is exactly the same. I think that's not acceptable at all. 5 server cluster should be alive even after 2 nodes failure. I guess we will see the same result on any number of server nodes. It does not make sense to add more nodes to provide better High Availability because it's not right HA cluster behaviour.
We still have this issue, we have nomad 1.7.2. If one of masters (even not leader) becomes unavailable, then few client nodes are lost in the cluster. They go to DOWN state (we see in Cluster/Clients UI section). Their allocations go to "LOST" status, and start on another client nodes. Then these clients return to the cluster (becomes READY again) after some seconds.
But I don't like that allocations are lost and restarted if one of masters is down. It's not appropriate, not HA consept. Losing one master out of 3 should be unnoticable for allocations in the cluster, I guess? @tgross please advice maybe it's not a bug but expected behavior? Or maybe we have misconfigurations on our nomad cluster that lead to this situation? As I see this situation - part of clients are bounded to specific master, and if this master is unavailable, they lost it and become down for short time, then they reconnect to another master. We don't point master servers IP anywhere, we use consul and, we have consul cluster servers listed in "retry_join" section of consul clients.
I've attached log file of one of the lost clients. Master server that was down is 10.1.13.203. client lost allocations logs.txt master leader logs.txt
Client config: nomad_client_hcl2.txt nomad_client_hcl.txt consul_client_hcl.txt nomad_master_hcl.txt
Hi @beninghton and thanks for providing the updated information which I have taken a look over. The configuration looks OK and there is nothing that jumps out as being an item that could be changed or tweaked.
It's not appropriate, not HA consept. Losing one master out of 3 should be unnoticable for allocations in the cluster, I guess? please advice maybe it's not a bug but expected behavior? Or maybe we have misconfigurations on our nomad cluster that lead to this situation?
I apologise you are finding your current experience is not appropriate, however, Nomad is designed to run in HA and the situation you are experiencing is not expected behaviour. It is also not behaviour I have been able to reproduce in my lab where I have 3 servers and 2 clients. The logs from my client whose server connection is broken can be seen below, which shows a graceful failover to a new connection. The allocations can also be seen, showing they have not been interrupted.
{"@caller":"github.com/hashicorp/nomad/client/rpc.go:119","@level":"error","@message":"error performing RPC to server","@module":"client.rpc","@timestamp":"2024-05-10T08:23:54.287409+01:00","error":"rpc error: No cluster leader","rpc":"Node.UpdateStatus","server":{"IP":"192.168.1.110","Port":4647,"Zone":""}}
{"@caller":"github.com/hashicorp/nomad/client/rpc.go:136","@level":"error","@message":"error performing RPC to server, deadline exceeded, cannot retry","@module":"client.rpc","@timestamp":"2024-05-10T08:23:54.288209+01:00","error":"rpc error: No cluster leader","rpc":"Node.UpdateStatus"}
{"@caller":"github.com/hashicorp/nomad/client/client.go:1754","@level":"error","@message":"error heartbeating. retrying","@module":"client","@timestamp":"2024-05-10T08:23:54.288393+01:00","error":"failed to update status: rpc error: No cluster leader","period":1033135722}
{"@caller":"github.com/hashicorp/nomad/client/client.go:3031","@level":"error","@message":"error discovering nomad servers","@module":"client","@timestamp":"2024-05-10T08:23:54.289604+01:00","error":"client.consul: unable to query Consul datacenters: Get \"http://127.0.0.1:8500/v1/catalog/datacenters\": dial tcp 127.0.0.1:8500: connect: connection refused"}
{"@caller":"github.com/hashicorp/nomad/client/client.go:2345","@level":"debug","@message":"received stale allocation information; retrying","@module":"client","@timestamp":"2024-05-10T08:25:19.618068+01:00","index":53,"min_index":53}
{"@caller":"github.com/hashicorp/nomad/client/servers/manager.go:208","@level":"debug","@message":"new server list","@module":"client.server_mgr","@timestamp":"2024-05-10T08:26:05.651876+01:00","new_servers":[{"Addr":{"IP":"192.168.1.111","Port":4647,"Zone":""}},{"Addr":{"IP":"192.168.1.112","Port":4647,"Zone":""}}],"old_servers":[{"Addr":{"IP":"192.168.1.112","Port":4647,"Zone":""}},{"Addr":{"IP":"192.168.1.110","Port":4647,"Zone":""}},{"Addr":{"IP":"192.168.1.111","Port":4647,"Zone":""}}]}
Allocations
ID Node ID Task Group Version Desired Status Created Modified
04a271b9 79087648 cache 0 run running 1h35m ago 1h35m ago
0a7d8b37 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
10e357be 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
3658aacb 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
386bd4b9 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
55902b4c 79087648 cache 0 run running 1h35m ago 1h35m ago
7e49eedf 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
969ebf84 79087648 cache 0 run running 1h35m ago 1h35m ago
9bc10a9d 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
b1e99875 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
c6b260c1 79087648 cache 0 run running 1h35m ago 1h35m ago
cfc07965 7c11d6d3 cache 0 run running 1h35m ago 1h35m ago
I have taken a look through the logs you provided, the server logs do seem to contain anything I wouldn't expect, given the situation you have described. The client logs have the interesting few lines:
2024-05-10T10:26:37+07:00 2024-05-10T03:26:37.396Z [ERROR] client.rpc: error performing RPC to server which is not safe to automatically retry: error="rpc error: EOF" rpc=Node.UpdateAlloc server=10.1.13.203:4647
2024-05-10T10:26:37+07:00 2024-05-10T03:26:37.396Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: EOF" rpc=Node.GetClientAllocs server=10.1.13.203:4647
2024-05-10T10:26:37+07:00 2024-05-10T03:26:37.400Z [INFO] client.consul: discovered following servers: servers=[10.1.13.203:4647, 10.1.13.202:4647, 10.1.13.201:4647]
2024-05-10T10:26:37+07:00 2024-05-10T03:26:37.404Z [ERROR] client.rpc: error performing RPC to server: error="rpc error: node cb67e458-6c40-d4d3-1c55-582fa5680a0c is not allowed to update allocs while in status down" rpc=Node.UpdateAlloc
It shows that the client -> server RPC fails when the server goes down and the client performs a discovery of the servers to allow it to connect to a different server. Immediately when connecting to next server, the RPC is rejected as the client state held indicates the client is in a down.
Do you have the leader logs in debug/info mode when the issue occurred, so that I can see what happened leading up to the failure? Do you have network monitoring in-place, so we could disregard potential issues here?
Hi @jrasell I have the same issue as beninghton
If RPC on Nomad Leader unavailable during interval 15-30 seconds (random delay during this interval) Nomad Clients stop and then start all allocations in Nomad cluster.
If I send down Nic interface on VM that has Nomad Leader role for interval 15-30 seconds issue is not occurs. Leader role move to another nomad server siding 1-4 seconds. If I stop nomad service with leader role issue is not occurs. Leader role move to another nomad server siding 1-4 seconds.
Issue occurs only if temporary not available nomad RPC port during interval 15-30 seconds on Nomad Leader. At least this scenario consistently reproduces the problem. It is possible that the problem can be reproduced under other circumstances.
My stage Nomad cluster has 3 servers and 4 clients (2 clients with node_pool=stg and 2 clients with node_pool=dev) In this debug will be used node_pool=stg Operation system - Rocky Linux release 8.7
Steps to reproduce issue:
# nomad server members
Name Address Port Status Leader Raft Version Build Datacenter Region
stg-nomad-server-01.eu 10.61.0.110 4648 alive false 3 1.6.3 ams02 eu
stg-nomad-server-02.eu 10.61.0.108 4648 alive true 3 1.6.3 ams02 eu
stg-nomad-server-03.eu 10.61.0.109 4648 alive false 3 1.6.3 ams02 eu
ID Node Pool DC Name Class Drain Eligibility Status e872ca8b stg ams02 stg-nomad-client-02 stg false eligible ready 478a04d3 dev ams02 dev-nomad-client-02 dev false eligible ready 3690048f dev ams02 dev-nomad-client-01 dev false eligible ready d1c59226 stg ams02 stg-nomad-client-01 stg false eligible ready
ID = d1c59226-9727-6ef7-a05f-d8acd8f7293c
Allocations ID Node ID Task Group Version Desired Status Created Modified 3e749103 d1c59226 panel_api_ams02 0 run running 3m33s ago 2m59s ago
2. Drop network connection to port 4647 on 20 seconds (server `stg-nomad-server-02`)
iptables -A INPUT -p tcp --destination-port 4647 -j DROP; sleep 20; iptables -D INPUT -p tcp --destination-port 4647 -j DROP
3. After step 2 checking allocation status in interval 1 seconds on nomad client `d1c59226`
Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run pending 1s ago 1s ago 3e749103 d1c59226 panel_api_ams02 0 stop lost 4m2s ago 3s ago
Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run pending 2s ago 1s ago 3e749103 d1c59226 panel_api_ams02 0 stop running 4m3s ago 1s ago
Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run pending 3s ago 2s ago 3e749103 d1c59226 panel_api_ams02 0 stop running 4m4s ago 2s ago
Allocations ID Node ID Task Group Version Desired Status Created Modified 5e936c40 d1c59226 panel_api_ams02 0 run running 5s ago 1s ago 3e749103 d1c59226 panel_api_ams02 0 stop running 4m6s ago 4s ago
As we see nomad client triggered allocation stop and then start.
I hope that my debug information will help you to reproduce issue.
Fixing this problem is critically important for us since an unexpected drop in services in the cluster prevents us from using Nomad in our environment.
Nomad configs:
[nomad-server_config.txt](https://github.com/hashicorp/nomad/files/15385933/nomad-server_config.txt)
[nomad-client_config.txt](https://github.com/hashicorp/nomad/files/15386002/nomad-client_config.txt)
Nomad Logs with debug level:
[nomad-server-3.log](https://github.com/hashicorp/nomad/files/15385946/nomad-server-3.log)
[nomad-server-2_leader.log](https://github.com/hashicorp/nomad/files/15385948/nomad-server-2_leader.log)
[nomad-server-1.log](https://github.com/hashicorp/nomad/files/15385949/nomad-server-1.log)
[nomad-client-1.log](https://github.com/hashicorp/nomad/files/15385951/nomad-client-1.log)
Hi @pavel-z1 and thanks for adding your context and reproduction. I have read through the issue a couple more times this morning and think I have a better idea of what is happening. I will move this onto our backlog and raise it internally; once it's picked up, engineers will post comments regarding the progress.
@jrasell I've reproduced it again for you, I just stopped one of our servers (10.1.13.201), which is not leader. Here I'm providing you with DEBUG logs from two other servers (10.1.13.202-non-leader and 10.1.13.203-leader) and one of the clients who lost all his allocations in the moment 10.1.13.201 was stopped. client.log server2-10.1.13.202-non-leader.log server3-10.1.13.203-leader.log
Our network is fine btw, we don't see any issues with it, it's 10G channel between them and we see no errors on the connection.
Hi @jrasell , @tgross
I've upgraded nomad to version v1.8.3
Issue still exists.
This issue lead to unstable state of whole nomad cluster when network flapping cases occurs only on one nomad master server. Maybe Nomad team can focus attention on this issue with higher priority to fix it?
Nomad version
Nomad v1.5.6 BuildDate 2023-05-19T18:26:13Z Revision https://github.com/hashicorp/nomad/commit/8af70885c02ab921dedbdf6bc406a1e886866f80
Cluster structure
3 master nodes: 10.1.15.21 - leader 10.1.15.22 10.1.15.23 2 client nodes: 10.1.15.31 10.1.15.32 3 consul cluster nodes: 10.1.15.11 10.1.15.12 10.1.15.13
Operating system and Environment details
Fedora release 35 (Thirty Five)
Issue
Issue is related to https://github.com/hashicorp/nomad/issues/17973. In the https://github.com/hashicorp/nomad/issues/17973 issue, after our leader Node1 had had CSI/cpu/mem problems we initially rebooted it. Then cluster lost leadership and Node2 became new leader. Cluster worked fine. Then Node1 was back online and joined the cluster, but it was CSI corrupted, and hanged right after joining the cluster. Then our cluster lost leadership. New leader was not elected. But this time client nodes were not down, and all allocations on the whole cluster were restarted.
Reproduction steps
After we removed CSI the only way to reproduce the issue quickly is to block 4647 port on non-leader node:
iptables -A INPUT -p tcp --destination-port 4647 -j DROP
We assume that it imitates the issue we had with CSI because it blocks not all, but a part of functionality of a node. In this case we block nomad-server2 (non-leader).Expected Result
We expected cluster to not fail. Two left nodes are fine, one of them is leader. Allocations on client nodes are not restarted.
Actual Result
New leader was not elected, all allocations on client nodes were restarted.
nomad server members
output on leader node:nomad server members
output on non-leader node nomad-server-2 (where 4647 is blocked):Nomad logs
client1.log client2.log server1-leader.log server2.log server3.log