Open write0nly opened 3 years ago
The gap of 90s becomes much shorter when using a more comprehensive fw rule set:
iptables -I INPUT -i eth0 -p tcp -m multiport --dport 8300,8301,8500,8501 -j DROP
iptables -I INPUT -i eth0 -p udp -m multiport --dport 8301 -j DROP
iptables -I INPUT -i eth0 -p udp -m multiport --sport 8301 -j DROP
iptables -I OUTPUT -o eth0 -p tcp -m multiport --dport 443,8300,8301,8500,8501 -j DROP
iptables -I OUTPUT -o eth0 -p udp -m multiport --dport 8301 -j DROP
iptables -I OUTPUT -o eth0 -p udp -m multiport --sport 8301 -j DROP
strace was used to prove that no UDP nor TCP traffic to the consul agent on eth0 was being exchanged.
Using the rule set above consul lock
on the fenced system behaves better but still not 100% predictably. It sometimes kills the child process in less than 30s, but sometimes it takes more than 60s causing node2 to enable the application before it is killed on node1.
$ ./node1.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 11:47:45 GMT 2021
Fri 19 Feb 11:47:46 GMT 2021
...
Fri 19 Feb 11:48:00 GMT 2021
Fri 19 Feb 11:48:01 GMT 2021 - fence triggered
Fri 19 Feb 11:48:02 GMT 2021 - 0
...
Fri 19 Feb 11:48:38 GMT 2021 - 36
Fri 19 Feb 11:48:39 GMT 2021 - 37
Fri 19 Feb 11:48:40 GMT 2021 - 38
Fri 19 Feb 11:48:41 GMT 2021 - 39
Fri 19 Feb 11:48:42 GMT 2021 - 40
Fri 19 Feb 11:48:43 GMT 2021 - 41
Fri 19 Feb 11:48:44 GMT 2021 - 42
Fri 19 Feb 11:48:45 GMT 2021 - 43
Fri 19 Feb 11:48:46 GMT 2021 - 44
Fri 19 Feb 11:48:47 GMT 2021 - 45
Fri 19 Feb 11:48:48 GMT 2021 - 46
Fri 19 Feb 11:48:49 GMT 2021 - 47
Fri 19 Feb 11:48:50 GMT 2021 - 48
Fri 19 Feb 11:48:51 GMT 2021 - 49
Fri 19 Feb 11:48:52 GMT 2021 - 50
Fri 19 Feb 11:48:53 GMT 2021 - 51
Fri 19 Feb 11:48:54 GMT 2021 - 52
Fri 19 Feb 11:48:55 GMT 2021 - 53
Fri 19 Feb 11:48:56 GMT 2021 - 54
Fri 19 Feb 11:48:57 GMT 2021 - 55
Fri 19 Feb 11:48:58 GMT 2021 - 56
Fri 19 Feb 11:48:59 GMT 2021 - 57
Fri 19 Feb 11:49:00 GMT 2021 - 58
Fri 19 Feb 11:49:01 GMT 2021 - 59
Fri 19 Feb 11:49:02 GMT 2021 - 60
Fri 19 Feb 11:49:03 GMT 2021 - 61
Error running handler: signal: terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (rpc error making call: stream closed)
then on node2 we can see that the app started at 11:48.43, so we had about 20s where the app was running on both nodes 1 and 2.
$ ./node2.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 11:48:43 GMT 2021 - started!
Trying one more test to clarify the behaviour and to avoid the EPERM errors due to iptables UDP DROP, which could be affecting the send behaviour of the app, the rule was changed to nat packets with dstport 8301/udp to a random non-existent host on port 8888. That way we emulate a behaviour where the packets are being sent and dropped in the network.
iptables -I INPUT -i eth0 -p tcp -m multiport --dport 8300,8301,8500,8501 -j DROP
iptables -I INPUT -i eth0 -p udp -m multiport --dport 8301 -j DROP
iptables -I INPUT -i eth0 -p udp -m multiport --sport 8301 -j DROP
iptables -I OUTPUT -o eth0 -p tcp -m multiport --dport 443,8300,8301,8500,8501 -j DROP
iptables -t nat -I OUTPUT -p udp -m udp --dport 8301 -j DNAT --to 10.0.0.1:8888
1- we start the app (a bash script and yes it is running under bash not /bin/sh), and trigger the fence at 14:05:27 as it can be seen below, the app still prints output for 60s, and is then killed with SIGTERM:
$ ./node1.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 14:04:57 GMT 2021 - starting myapp
Fri 19 Feb 14:04:57 GMT 2021
Fri 19 Feb 14:04:58 GMT 2021
Fri 19 Feb 14:04:59 GMT 2021
Fri 19 Feb 14:05:00 GMT 2021
Fri 19 Feb 14:05:01 GMT 2021
Fri 19 Feb 14:05:02 GMT 2021
...
Fri 19 Feb 14:05:25 GMT 2021
Fri 19 Feb 14:05:26 GMT 2021
Fri 19 Feb 14:05:27 GMT 2021 - fence triggered!
Fri 19 Feb 14:05:27 GMT 2021 - 0 s since fence triggered
Fri 19 Feb 14:05:28 GMT 2021 - 1 s since fence triggered
Fri 19 Feb 14:05:29 GMT 2021 - 2 s since fence triggered
Fri 19 Feb 14:05:30 GMT 2021 - 3 s since fence triggered
Fri 19 Feb 14:05:31 GMT 2021 - 4 s since fence triggered
Fri 19 Feb 14:05:32 GMT 2021 - 5 s since fence triggered
Fri 19 Feb 14:05:33 GMT 2021 - 6 s since fence triggered
...
Fri 19 Feb 14:06:22 GMT 2021 - 54 s since fence triggered
Fri 19 Feb 14:06:23 GMT 2021 - 55 s since fence triggered
Fri 19 Feb 14:06:24 GMT 2021 - 56 s since fence triggered
Fri 19 Feb 14:06:25 GMT 2021 - 57 s since fence triggered
Fri 19 Feb 14:06:26 GMT 2021 - 58 s since fence triggered
Fri 19 Feb 14:06:27 GMT 2021 - 59 s since fence triggered
Lock lost, killing child
Terminating child pid 14823
Error running handler: signal: terminated
Child terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout)
2- consul logs in node1 (logs had to be sanitised):
Feb 19 14:05:53 node1 consul[14724]: 2021-02-19T14:05:53.036Z [INFO] agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.139Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.142Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.4.0.124:8300 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.144Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent=&index=3137487334 from=127.0.0.1:64964 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.146Z [ERROR] agent.client: RPC failed to server: method=Session.Renew server=10.4.0.124:8300 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.148Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/renew/1e8876a1-9a0c-da6a-a8b0-c8a1d61a8d94 from=127.0.0.1:64970 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.150Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.4.0.124:8300 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.150Z [ERROR] agent: Coordinate update error: error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.153Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.198Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:58 node1 consul[14724]: 2021-02-19T14:05:58.524Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:58 node1 consul[14724]: 2021-02-19T14:05:58.540Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:06:05 node1 consul[14724]: 2021-02-19T14:06:05.148Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.20.0.234:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:06:05 node1 consul[14724]: 2021-02-19T14:06:05.150Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent= from=127.0.0.1:64980 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.154Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.157Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent= from=127.0.0.1:64988 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.159Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.161Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:06:27 node1 consul[14724]: 2021-02-19T14:06:27.161Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.4.0.5:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.5:8300: i/o timeout"
Feb 19 14:06:27 node1 consul[14724]: 2021-02-19T14:06:27.164Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent= from=127.0.0.1:64990 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.5:8300: i/o timeout"
Feb 19 14:06:28 node1 consul[14724]: 2021-02-19T14:06:28.039Z [INFO] agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:06:37 node1 consul[14724]: 2021-02-19T14:06:37.169Z [ERROR] agent.client: RPC failed to server: method=KVS.Apply server=10.2.0.1:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
Feb 19 14:06:37 node1 consul[14724]: 2021-02-19T14:06:37.172Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/kv/myapp/.lock?flags=3304740253564472344&release=1e8876a1-9a0c-da6a-a8b0-c8a1d61a8d94 from=127.0.0.1:64996 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
Feb 19 14:06:51 node1 consul[14724]: 2021-02-19T14:06:51.639Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.4.0.124:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.124:8300: i/o timeout"
Feb 19 14:06:51 node1 consul[14724]: 2021-02-19T14:06:51.642Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.124:8300: i/o timeout"
Feb 19 14:07:08 node1 consul[14724]: 2021-02-19T14:07:08.042Z [INFO] agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.960Z [ERROR] agent.client: RPC failed to server: method=Catalog.NodeServiceList server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.962Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.960Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.966Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:07:39 node1 consul[14724]: 2021-02-19T14:07:39.515Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.20.0.234:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:07:39 node1 consul[14724]: 2021-02-19T14:07:39.517Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:07:48 node1 consul[14724]: 2021-02-19T14:07:48.048Z [INFO] agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:08:04 node1 consul[14724]: 2021-02-19T14:08:04.942Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.2.0.1:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
Feb 19 14:08:04 node1 consul[14724]: 2021-02-19T14:08:04.944Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
3- Meanwhile in node2:
$ ./node2.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 14:06:08 GMT 2021 - started!
therefore we can see that the app on node2 started at 14:06:08, and the app in node1 didn't get killed until 14:06:27-28, so we got 20s of the app running in both places while under the protection of a mutex.
4- Any help on this would be appreciated. If the answer is that consul locking cannot guarantee mutexes completely, and/or requires on other operations taking place such as heartbeat read or write to the kv then we would like to understand it prior to continuing to consider consul for this purpose.
Also if we can use consul lock
to make the session more robust, including further checks and tweaking the timeouts that could maybe create a workaround to this behaviour.
Overview of the Issue
Consul lock does not kill child process until +90s after locking timeout.
If node1 and node2 use a consul lock (n=1) to coordinate active/passive states on the application and the active node gets fenced off from all communication with the consul leader, the consul lock command on the fenced off node does not kill the app strictly after the timeout 15s+15s, but after 15s+15s+90s, causing both apps to be active at the same time because node2 is granted the lock after 15s+15s of the disappearance of node1.
Reproduction Steps
Consul info for both Client and Server
Note that nodes are running consul 1.9.2 and the consul masters themselves are running 1.6.1 all on linux Centos7. All consul binaries have been download directly from hashicorp and not built from source.