consul lock does not enforce SIGTERM to child process within 30s

write0nly commented 3 years ago

Overview of the Issue

Consul lock does not kill child process until +90s after locking timeout.

If node1 and node2 use a consul lock (n=1) to coordinate active/passive states on the application and the active node gets fenced off from all communication with the consul leader, the consul lock command on the fenced off node does not kill the app strictly after the timeout 15s+15s, but after 15s+15s+90s, causing both apps to be active at the same time because node2 is granted the lock after 15s+15s of the disappearance of node1.

Reproduction Steps

on node1 launch a process protected by a n=1 lock (using localhost connection to consul agent running locally on localhost:8500). node1 is immediately granted the lock and myapp starts.
```
 consul lock --name lock-node1 --verbose kv/myapp/.lock myapp
```
on node2 launch a process attempting to acquire the same lock as node1 (using localhost connection to consul agent running locally on localhost:8500). node2 does not get the lock and therefore stays waiting.
```
 consul lock --name lock-node1 --verbose kv/myapp/.lock myapp
```
sessions seem ok when we inspect our sessions with curl:

[
  {
    "ID": "0cb7dd51-c32a-ef90-cc44-79542792a5b2",
    "Name": "lock-node1",
    "Node": "node1",
    "Checks": [
      "serfHealth"
    ],
    "LockDelay": 15000000000,
    "Behavior": "release",
    "TTL": "15s",
    "CreateIndex": 3137459821,
    "ModifyIndex": 3137459821
  },

use iptables on node1 to firewall it completely off all access to consul
```
iptables -I OUTPUT -p tcp -m multiport --dport 443,8300,8500 -j DROP
```
watch consul on both nodes and notice the standard 15s TTL timeout + 15s of lock grace time elapse, when that elapses node2's consul lock process executes myapp on node2.
on node1 myapp running under the protection of the consul lock continues to run, so myapp is running in both places at the same time, which goes against the expected behaviour.

wait further 90s and notice node1 finally kill the app with SIGTERM

Lock lost, killing child
Terminating child pid 5504
Error running handler: signal: terminated
Child terminated
Lock release failed: failed to release lock: Put "http://127.0.0.1:8500/v1/kv/kv/myapp/.lock myapp?flags=3304740253564472344&release=9713c575-6b28-029a-b3ca-78eb4beb3519": dial tcp 127.0.0.1:8500: i/o timeout

Consul info for both Client and Server

Note that nodes are running consul 1.9.2 and the consul masters themselves are running 1.6.1 all on linux Centos7. All consul binaries have been download directly from hashicorp and not built from source.

write0nly commented 3 years ago

The gap of 90s becomes much shorter when using a more comprehensive fw rule set:

    iptables -I INPUT -i eth0 -p tcp -m multiport --dport 8300,8301,8500,8501 -j DROP
    iptables -I INPUT -i eth0 -p udp -m multiport --dport 8301 -j DROP
    iptables -I INPUT -i eth0 -p udp -m multiport --sport 8301 -j DROP

    iptables -I OUTPUT -o eth0 -p tcp -m multiport --dport 443,8300,8301,8500,8501 -j DROP
    iptables -I OUTPUT -o eth0 -p udp -m multiport --dport 8301 -j DROP
    iptables -I OUTPUT -o eth0 -p udp -m multiport --sport 8301 -j DROP

strace was used to prove that no UDP nor TCP traffic to the consul agent on eth0 was being exchanged.

Using the rule set above consul lock on the fenced system behaves better but still not 100% predictably. It sometimes kills the child process in less than 30s, but sometimes it takes more than 60s causing node2 to enable the application before it is killed on node1.

$ ./node1.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 11:47:45 GMT 2021
Fri 19 Feb 11:47:46 GMT 2021
...
Fri 19 Feb 11:48:00 GMT 2021
Fri 19 Feb 11:48:01 GMT 2021 - fence triggered
Fri 19 Feb 11:48:02 GMT 2021 - 0
...
Fri 19 Feb 11:48:38 GMT 2021 - 36
Fri 19 Feb 11:48:39 GMT 2021 - 37
Fri 19 Feb 11:48:40 GMT 2021 - 38
Fri 19 Feb 11:48:41 GMT 2021 - 39
Fri 19 Feb 11:48:42 GMT 2021 - 40
Fri 19 Feb 11:48:43 GMT 2021 - 41
Fri 19 Feb 11:48:44 GMT 2021 - 42
Fri 19 Feb 11:48:45 GMT 2021 - 43
Fri 19 Feb 11:48:46 GMT 2021 - 44
Fri 19 Feb 11:48:47 GMT 2021 - 45
Fri 19 Feb 11:48:48 GMT 2021 - 46
Fri 19 Feb 11:48:49 GMT 2021 - 47
Fri 19 Feb 11:48:50 GMT 2021 - 48
Fri 19 Feb 11:48:51 GMT 2021 - 49
Fri 19 Feb 11:48:52 GMT 2021 - 50
Fri 19 Feb 11:48:53 GMT 2021 - 51
Fri 19 Feb 11:48:54 GMT 2021 - 52
Fri 19 Feb 11:48:55 GMT 2021 - 53
Fri 19 Feb 11:48:56 GMT 2021 - 54
Fri 19 Feb 11:48:57 GMT 2021 - 55
Fri 19 Feb 11:48:58 GMT 2021 - 56
Fri 19 Feb 11:48:59 GMT 2021 - 57
Fri 19 Feb 11:49:00 GMT 2021 - 58
Fri 19 Feb 11:49:01 GMT 2021 - 59
Fri 19 Feb 11:49:02 GMT 2021 - 60
Fri 19 Feb 11:49:03 GMT 2021 - 61
Error running handler: signal: terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (rpc error making call: stream closed)

then on node2 we can see that the app started at 11:48.43, so we had about 20s where the app was running on both nodes 1 and 2.

$ ./node2.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 11:48:43 GMT 2021 - started!

write0nly commented 3 years ago

Trying one more test to clarify the behaviour and to avoid the EPERM errors due to iptables UDP DROP, which could be affecting the send behaviour of the app, the rule was changed to nat packets with dstport 8301/udp to a random non-existent host on port 8888. That way we emulate a behaviour where the packets are being sent and dropped in the network.

    iptables -I INPUT -i eth0 -p tcp -m multiport --dport 8300,8301,8500,8501 -j DROP
    iptables -I INPUT -i eth0 -p udp -m multiport --dport 8301 -j DROP
    iptables -I INPUT -i eth0 -p udp -m multiport --sport 8301 -j DROP

    iptables -I OUTPUT -o eth0 -p tcp -m multiport --dport 443,8300,8301,8500,8501 -j DROP
    iptables -t nat -I OUTPUT -p udp -m udp --dport 8301 -j DNAT --to 10.0.0.1:8888

1- we start the app (a bash script and yes it is running under bash not /bin/sh), and trigger the fence at 14:05:27 as it can be seen below, the app still prints output for 60s, and is then killed with SIGTERM:

$ ./node1.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 14:04:57 GMT 2021 - starting myapp
Fri 19 Feb 14:04:57 GMT 2021
Fri 19 Feb 14:04:58 GMT 2021
Fri 19 Feb 14:04:59 GMT 2021
Fri 19 Feb 14:05:00 GMT 2021
Fri 19 Feb 14:05:01 GMT 2021
Fri 19 Feb 14:05:02 GMT 2021
...
Fri 19 Feb 14:05:25 GMT 2021
Fri 19 Feb 14:05:26 GMT 2021
Fri 19 Feb 14:05:27 GMT 2021 - fence triggered!
Fri 19 Feb 14:05:27 GMT 2021 - 0 s since fence triggered
Fri 19 Feb 14:05:28 GMT 2021 - 1 s since fence triggered
Fri 19 Feb 14:05:29 GMT 2021 - 2 s since fence triggered
Fri 19 Feb 14:05:30 GMT 2021 - 3 s since fence triggered
Fri 19 Feb 14:05:31 GMT 2021 - 4 s since fence triggered
Fri 19 Feb 14:05:32 GMT 2021 - 5 s since fence triggered
Fri 19 Feb 14:05:33 GMT 2021 - 6 s since fence triggered
...
Fri 19 Feb 14:06:22 GMT 2021 - 54 s since fence triggered
Fri 19 Feb 14:06:23 GMT 2021 - 55 s since fence triggered
Fri 19 Feb 14:06:24 GMT 2021 - 56 s since fence triggered
Fri 19 Feb 14:06:25 GMT 2021 - 57 s since fence triggered
Fri 19 Feb 14:06:26 GMT 2021 - 58 s since fence triggered
Fri 19 Feb 14:06:27 GMT 2021 - 59 s since fence triggered
Lock lost, killing child
Terminating child pid 14823
Error running handler: signal: terminated
Child terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout)

2- consul logs in node1 (logs had to be sanitised):

Feb 19 14:05:53 node1 consul[14724]: 2021-02-19T14:05:53.036Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.139Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.142Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.4.0.124:8300 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.144Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent=&index=3137487334 from=127.0.0.1:64964 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.146Z [ERROR] agent.client: RPC failed to server: method=Session.Renew server=10.4.0.124:8300 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.148Z [ERROR] agent.http: Request error: method=PUT url=/v1/session/renew/1e8876a1-9a0c-da6a-a8b0-c8a1d61a8d94 from=127.0.0.1:64970 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.150Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.4.0.124:8300 error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.150Z [ERROR] agent: Coordinate update error: error="rpc error making call: EOF"
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.153Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:54 node1 consul[14724]: 2021-02-19T14:05:54.198Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:58 node1 consul[14724]: 2021-02-19T14:05:58.524Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:05:58 node1 consul[14724]: 2021-02-19T14:05:58.540Z [ERROR] agent: yamux: keepalive failed: i/o deadline reached
Feb 19 14:06:05 node1 consul[14724]: 2021-02-19T14:06:05.148Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.20.0.234:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:06:05 node1 consul[14724]: 2021-02-19T14:06:05.150Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent= from=127.0.0.1:64980 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.154Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.157Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent= from=127.0.0.1:64988 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.159Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:06:16 node1 consul[14724]: 2021-02-19T14:06:16.161Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:06:27 node1 consul[14724]: 2021-02-19T14:06:27.161Z [ERROR] agent.client: RPC failed to server: method=KVS.Get server=10.4.0.5:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.5:8300: i/o timeout"
Feb 19 14:06:27 node1 consul[14724]: 2021-02-19T14:06:27.164Z [ERROR] agent.http: Request error: method=GET url=/v1/kv/kv/myapp/.lock?consistent= from=127.0.0.1:64990 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.5:8300: i/o timeout"
Feb 19 14:06:28 node1 consul[14724]: 2021-02-19T14:06:28.039Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:06:37 node1 consul[14724]: 2021-02-19T14:06:37.169Z [ERROR] agent.client: RPC failed to server: method=KVS.Apply server=10.2.0.1:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
Feb 19 14:06:37 node1 consul[14724]: 2021-02-19T14:06:37.172Z [ERROR] agent.http: Request error: method=PUT url=/v1/kv/kv/myapp/.lock?flags=3304740253564472344&release=1e8876a1-9a0c-da6a-a8b0-c8a1d61a8d94 from=127.0.0.1:64996 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
Feb 19 14:06:51 node1 consul[14724]: 2021-02-19T14:06:51.639Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.4.0.124:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.124:8300: i/o timeout"
Feb 19 14:06:51 node1 consul[14724]: 2021-02-19T14:06:51.642Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.4.0.124:8300: i/o timeout"
Feb 19 14:07:08 node1 consul[14724]: 2021-02-19T14:07:08.042Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.960Z [ERROR] agent.client: RPC failed to server: method=Catalog.NodeServiceList server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.962Z [ERROR] agent.anti_entropy: failed to sync remote state: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.188:8300: i/o timeout"
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.960Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.2.0.188:8300 error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:07:11 node1 consul[14724]: 2021-02-19T14:07:11.966Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: rpc error: lead thread didn't get connection"
Feb 19 14:07:39 node1 consul[14724]: 2021-02-19T14:07:39.515Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.20.0.234:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:07:39 node1 consul[14724]: 2021-02-19T14:07:39.517Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.20.0.234:8300: i/o timeout"
Feb 19 14:07:48 node1 consul[14724]: 2021-02-19T14:07:48.048Z [INFO]  agent.client.memberlist.lan: memberlist: Suspect node-sanitised-xxx has failed, no acks received
Feb 19 14:08:04 node1 consul[14724]: 2021-02-19T14:08:04.942Z [ERROR] agent.client: RPC failed to server: method=Coordinate.Update server=10.2.0.1:8300 error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"
Feb 19 14:08:04 node1 consul[14724]: 2021-02-19T14:08:04.944Z [ERROR] agent: Coordinate update error: error="rpc error getting client: failed to get conn: dial tcp 10.0.9.197:0->10.2.0.1:8300: i/o timeout"

3- Meanwhile in node2:

$ ./node2.sh
Setting up lock at path: kv/myapp/.lock
Attempting lock acquisition
Starting handler
Fri 19 Feb 14:06:08 GMT 2021 - started!

therefore we can see that the app on node2 started at 14:06:08, and the app in node1 didn't get killed until 14:06:27-28, so we got 20s of the app running in both places while under the protection of a mutex.

4- Any help on this would be appreciated. If the answer is that consul locking cannot guarantee mutexes completely, and/or requires on other operations taking place such as heartbeat read or write to the kv then we would like to understand it prior to continuing to consider consul for this purpose.

Also if we can use consul lock to make the session more robust, including further checks and tweaking the timeouts that could maybe create a workaround to this behaviour.

hashicorp / consul