hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.32k stars 4.42k forks source link

Define consul lock communication more explicitly... #985

Open sean- opened 9 years ago

sean- commented 9 years ago

From the docs (https://consul.io/docs/commands/lock.html)

If the lock is lost or communication is disrupted, the child process is terminated.

What exactly does this mean? Does it mean the TCP connection between the agent holding the lock and the server with the leader status is interrupted? How does the "communication is disrupted" interact with Serf for liveliness?

If there is a partition between the agent holding the lock (AgentA) and the server leader, but the AgentA is still on the network and able to be contacted by Serf from other Agents in the data center, what happens? Said another way, if AgentA can't talk to the Server, but AgentB can reach both AgentA and the Server, how does the system handle this degraded state?

highlyunavailable commented 9 years ago

What it exactly means is that if the lock is lost, then the child will be terminated. The code is taking the following steps:

  1. Acquire the lock, which means creating a Session, then using the KV.Acquire function to apply that sessionID to a key. By a lock session is just a session with a TTL that is then renewed with Session.RenewPeriodic which does exactly what the name implies. A semaphore has a couple more steps but they're mostly bookkeeping and don't really change how the lock can be lost.
  2. Runs the child process.
  3. If the lock channel closes before a shutdown is requested or the child process completes, terminates the child process with SIGTERM -> wait -> SIGKILL.

The lock channel can be closed by the following 3 cases, all of which apply to a Semaphore as well:

  1. The agent cannot update the session with a server (not necessarily the the leader) within the TTL timeout (regardless of serf healthcheck). Communication errors actually are not a problem as long as the agent can reconnect to a server within the session TTL, which is 15 seconds.
  2. An admin or process explicitly invalidates (using Session.Destroy) the session that is used to hold the key.
  3. The key that is being locked on is deleted by an admin or process (acquiring a key with a session does not prevent deletion of said key).

So, to answer each of your questions one by one:

sean- commented 9 years ago

If there is a partition between the agent holding the lock (AgentA) and the server leader, but the AgentA is still on the network and able to be contacted by Serf from other Agents in the data center, what happens? Said another way, if AgentA can't talk to the Server, but AgentB can reach both AgentA and the Server, how does the system handle this degraded state?

The system doesn't handle this, it must be able to talk to a server to update the state (TTL), but it it shouldn't need to talk to the leader explicitly, just a server.

Thank you for the above clarification. If the AgentA is able to reach ServerA and ServerB, but ServerC is the leader and partitioned off only from AgentA (i.e. ServerC is still on the network, but there is some transient connectivity issue), will AgentA attempt to refresh its lock by connecting with either ServerA or ServerB automatically?

My understanding is that if all Servers are online and able to communicate, a comm failure between AgentA and ServerC won't start a new RAFT election or a new term. What is less clear to me is that newClient() is called again to reset the client.config.Address. I don't see any wrapper around the client that will assign a new Address in the event of a partition.

https://github.com/hashicorp/consul/blob/master/command/lock.go#L109

? Does that imply that a partition between AgentA and ServerC will result in a loss of the lock if the partition is greater than 15s? I'm not suggesting or requesting that the client thundering herd attempt to talk to all servers, so much as looking for clarification in terms of what failure modes exist.

highlyunavailable commented 9 years ago

I'm going to try this out and report back.

highlyunavailable commented 9 years ago

Okay, so this is very interesting and behaves super differently in a variety of scenarios:

  1. I set up 3 servers, got them into a cluster.
  2. I set up 1 agent, joined it to the cluster.
  3. I then did consul lock test "sleep 60"

Now, this is where it varies (TL;DR scroll to the end to watch the lock be violated):

Scenario 1: Cleanly exiting the leader with leave_on_terminate at its default value of true, which means it shuts itself down and sends a message to remove itself from Raft, which isn't the same as a partition, but can easily happen.

I got the following result:

vagrant@agentA:~$ ./consul lock test "sleep 60"
Error running handler: signal: terminated
signal: terminated
Lock release failed: failed to release lock: Unexpected response code: 500 (rpc error: No cluster leader)

I immediately tried running the same command and got the following error until the election finished:

vagrant@agentA:~$ ./consul lock test "sleep 60"
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: connection is shut down)
vagrant@agentA:~$

I can see not being able to create a lock during an election, so I'm actually fine with this except the error is a bit unclear.


Scenario 2:

I ran pkill -9 consul on the leader box after starting a new round of "sleep 60".

Results:

Serf immediately detected the failure in the server, but there were 0 errors from the agent and the 60 seconds finished cleanly. However, because of what you'll see in scenario 3, the agent was clearly not connected with the leader, but some other server. The output was as follows:

    2015/06/25 03:02:36 [DEBUG] http: Request /v1/session/create (6.70589ms)
    2015/06/25 03:02:36 [DEBUG] http: Request /v1/kv/test/.lock?wait=15000ms (1.578075ms)
    2015/06/25 03:02:36 [DEBUG] http: Request /v1/kv/test/.lock?acquire=9ea91012-04a2-973e-2199-61530220adbd&flags=3304740253564472344 (3.492827ms)
    2015/06/25 03:02:36 [DEBUG] http: Request /v1/kv/test/.lock?consistent= (2.779486ms)
    2015/06/25 03:02:43 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (1.77661ms)
    2015/06/25 03:02:45 [DEBUG] memberlist: TCP connection from: 192.168.33.12:54850
    2015/06/25 03:02:51 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (1.587599ms)
    2015/06/25 03:02:53 [INFO] memberlist: Suspect serverA has failed, no acks received
    2015/06/25 03:02:56 [INFO] memberlist: Suspect serverA has failed, no acks received
    2015/06/25 03:02:58 [INFO] memberlist: Suspect serverA has failed, no acks received
    2015/06/25 03:02:58 [INFO] memberlist: Marking serverA as failed, suspect timeout reached
    2015/06/25 03:02:58 [INFO] serf: EventMemberFailed: serverA 192.168.33.11
    2015/06/25 03:02:58 [INFO] consul: removing server serverA (Addr: 192.168.33.11:8300) (DC: dc1)
    2015/06/25 03:02:58 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (1.727909ms)
    2015/06/25 03:02:59 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.33.12:8301
    2015/06/25 03:03:01 [DEBUG] serf: forgoing reconnect for random throttling
    2015/06/25 03:03:06 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (1.68642ms)
    2015/06/25 03:03:13 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (1.436247ms)
    2015/06/25 03:03:15 [DEBUG] memberlist: TCP connection from: 192.168.33.12:54876
    2015/06/25 03:03:21 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (2.342892ms)
    2015/06/25 03:03:28 [DEBUG] http: Request /v1/session/renew/9ea91012-04a2-973e-2199-61530220adbd (2.402064ms)
    2015/06/25 03:03:29 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.33.13:8301
    2015/06/25 03:03:31 [DEBUG] serf: forgoing reconnect for random throttling
    2015/06/25 03:03:36 [DEBUG] http: Request /v1/kv/test/.lock?flags=3304740253564472344&release=9ea91012-04a2-973e-2199-61530220adbd (4.957429ms)
    2015/06/25 03:03:36 [DEBUG] http: Request /v1/kv/test/.lock?consistent=&index=104 (1m0.005539804s)
    2015/06/25 03:03:36 [DEBUG] http: Request /v1/kv/test/.lock (2.310693ms)
    2015/06/25 03:03:36 [DEBUG] http: Request /v1/session/destroy/9ea91012-04a2-973e-2199-61530220adbd (6.032076ms)
    2015/06/25 03:03:36 [DEBUG] http: Request /v1/kv/test/.lock?cas=108 (4.490823ms)

Scenario 3 (The one you're interested in):

I started a new consul lock sleep test and started running sudo iptables -I INPUT -s 192.168.33.10 -j DROP on boxes until I figured out which one the agent was talking to.

This is some strange behavior that I think @armon or someone else that wrote this needs to comment on.

Here's what happened, with consul communicating with serverB:

    2015/06/25 03:15:05 [DEBUG] http: Request /v1/session/create (5.305826ms)
    2015/06/25 03:15:05 [DEBUG] http: Request /v1/kv/test/.lock?wait=15000ms (1.212142ms)
    2015/06/25 03:15:05 [DEBUG] http: Request /v1/kv/test/.lock?acquire=46e20bb7-f59d-8e4a-8484-a12d9b76d843&flags=3304740253564472344 (3.735569ms)
    2015/06/25 03:15:05 [DEBUG] http: Request /v1/kv/test/.lock?consistent= (1.380977ms)
    2015/06/25 03:15:09 [ERR] memberlist: Push/Pull with serverC failed: dial tcp 192.168.33.13:8301: i/o timeout
    2015/06/25 03:15:13 [DEBUG] http: Request /v1/session/renew/46e20bb7-f59d-8e4a-8484-a12d9b76d843 (1.408075ms)
    2015/06/25 03:15:18 [INFO] memberlist: Suspect serverB has failed, no acks received
    2015/06/25 03:15:20 [DEBUG] http: Request /v1/session/renew/46e20bb7-f59d-8e4a-8484-a12d9b76d843 (966.864µs)
    2015/06/25 03:15:23 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
    2015/06/25 03:15:23 [INFO] serf: EventMemberFailed: serverB 192.168.33.12
    2015/06/25 03:15:23 [INFO] consul: removing server serverB (Addr: 192.168.33.12:8300) (DC: dc1)
    2015/06/25 03:15:35 [INFO] memberlist: Suspect serverA has failed, no acks received
    2015/06/25 03:15:38 [WARN] memberlist: Refuting a suspect message (from: serverA)
    2015/06/25 03:15:39 [DEBUG] memberlist: Initiating push/pull sync with: 192.168.33.13:8301
    2015/06/25 03:15:41 [INFO] serf: EventMemberJoin: serverB 192.168.33.12
    2015/06/25 03:15:41 [INFO] consul: adding server serverB (Addr: 192.168.33.12:8300) (DC: dc1)
    2015/06/25 03:15:55 [INFO] memberlist: Suspect serverB has failed, no acks received
    2015/06/25 03:16:00 [INFO] memberlist: Marking serverB as failed, suspect timeout reached
    2015/06/25 03:16:00 [INFO] serf: EventMemberFailed: serverB 192.168.33.12
    2015/06/25 03:16:00 [INFO] consul: removing server serverB (Addr: 192.168.33.12:8300) (DC: dc1)

There are 2 problems here:

  1. Consul lock hangs, and there's a chance depending on which server died to actually not even show the "Error running handler: signal: terminated" message. It doesn't respond to SIGTERM at all until the _child process exits_, it needs to be killed with kill -9. I'm suspecting a deadlock waiting on a channel somewhere.
vagrant@agentA:~$ ./consul lock test "sleep 60"
Error running handler: signal: terminated
signal: terminated

^C^C^C^C
  1. The agent does not try to reconnect with another server and re-establish the session renewal. This wouldn't be a big deal except that if it can't renew the session, the session is destroyed due to the TTL, and also due to problem 1, the child process is still running. This seems problematic.

I also replicated it on a single server/single agent - same problem, but this time it seemed to detect the failure a bit better due to the complete lack of consul servers. It still left the child process running though!

armon commented 9 years ago

@highlyunavailable Thanks for finding this! My guess is there is a channel blocking somewhere as well in the tear down path. I've tagged this as a bug. As it should just kill the child process in scenario 3.

armon commented 9 years ago

@sean- If the client is partitioned off from one of the servers, who happens to be the leader, then things should still work. Clients pick a random server to talk to for RPCs for load balancing, and then the servers do internal request forwarding if they are not the leader. Given some time, the client should detect that server as partitioned via Serf and remove it from the list of eligible servers. So when the client makes an RPC call, it will be to one of the non-partitioned servers and that server should be able to forward to the leader. At least, thats how it should work :)

jeinwag commented 9 years ago

@armon So is the behaviour that highylunavailable described in scenario 1 the intended one? Meaning loss of the leader and election of a new one implies loss of all locks?

slackpad commented 7 years ago

This is related to https://github.com/hashicorp/consul/issues/1843 where we are talking about using the ability to update the session as a possible signal to give up the lock.