hashicorp / consul-replicate

Consul cross-DC KV replication daemon.
https://www.hashicorp.com/
Mozilla Public License 2.0
517 stars 71 forks source link

Lock acquisition failing #14

Closed calvn closed 8 years ago

calvn commented 9 years ago

I am running into the following issue:

Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: No cluster leader)

The clusters are set up like so: 5 server nodes on dc1 and 5 on dc2. I am running multiple instances of consul-replicate through consul lock on upstart for high-availability on dc2, which is trying to replicate a set of KV's on dc1. I verified leadership in both datacenters via the /v1/status/leader API call and manually through consul info that the agent is indeed the leader.

armon commented 9 years ago

@cleung2010 Could it just be a blip that happened during a leader election? Or is this a more permanent error you are getting?

calvn commented 9 years ago

Somehow the service on all the nodes on dc2 had stopped overnight with the logs ending on that message. I started them back up and they seem to be running fine now. I'll go ahead and close this issue, and re-open if I observe a recurrence. Thanks for the quick reply!

calvn commented 9 years ago

@armon I am still seeing this issue. Here is a more comprehensive log:

Lock lost, killing child
Terminating child pid 21841
Error running handler: signal: terminated
signal: terminated
Child terminated
Cleanup succeeded
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)

Here is the upstart script for consul-replicate service which calls consul lock: https://github.com/Cimpress-MCP/puppet-consul_replicate/blob/master/templates/upstart.erb

armon commented 9 years ago

@cleung2010 The error "Check 'serfHealth' is in critical state" means that the node has a check in the critical state. This will prevent lock acquisition, because the cluster already suspects that node of having failed. If the lock was granted to a failed node, it would potentially deadlock the cluster. You need to ensure the networking setup is healthy and that running agents are not marked as critical.

calvn commented 9 years ago

I just double-checked the ports on both the servers and the client and they have the required TCP/UDP ports opened. The consul agent on the client in still running, even though the service that is using consul lock (i.e. consul-replicate) died.

calvn commented 9 years ago

This is the log from another consul agent that is running another service, also using consul lock, that is experiencing the same issue. This agent also has all the necessary ports opened.

2015/07/27 18:32:36 [INFO] memberlist: Marking prdqasicecns001 as failed, suspect timeout reached 
2015/07/27 18:32:36 [INFO] serf: EventMemberFailed: prdqasicecns001 10.66.30.221    
2015/07/27 18:32:36 [INFO] consul: removing server prdqasicecns001 (Addr: 10.66.30.221:8300) (DC: qas)
2015/07/27 18:32:40 [WARN] memberlist: Refuting a dead message (from: prdqasicecns001)
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: rpc error: Check 'serfHealth' is in critical state
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: rpc error: Check 'serfHealth' is in critical state    
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: Check 'serfHealth' is in critical state
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: Check 'serfHealth' is in critical state
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: rpc error: Check 'serfHealth' is in critical state
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: rpc error: Check 'serfHealth' is in critical state
2015/07/27 18:32:40 [ERR] http: Request /v1/session/create?token=<hidden>, error: rpc error: Check 'serfHealth' is in critical state
2015/07/27 18:32:43 [INFO] memberlist: Marking devqasicerdt002 as failed, suspect timeout reached
2015/07/27 18:32:43 [INFO] serf: EventMemberFailed: devqasicerdt002 10.66.28.227
2015/07/27 18:32:44 [INFO] serf: EventMemberFailed: devqasicegrk001 10.66.28.35
2015/07/27 18:32:46 [INFO] serf: EventMemberFailed: prdqasicearf002 10.66.28.58
2015/07/27 18:32:55 [INFO] serf: EventMemberJoin: prdqasicecns001 10.66.30.221
calvn commented 9 years ago

@armon this is the iptables for the server nodes

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     tcp  --  10.66.0.0/16         anywhere             multiport dports 8301 /* 010 allow consul LAN TCP for 10.66.0.0/16 */
ACCEPT     tcp  --  127.0.0.0/8          anywhere             multiport dports 8301 /* 010 allow consul LAN TCP for 127.0.0.1/8 */
ACCEPT     udp  --  10.66.0.0/16         anywhere             multiport dports 8301 /* 010 allow consul LAN UDP for 10.66.0.0/16 */
ACCEPT     udp  --  127.0.0.0/8          anywhere             multiport dports 8301 /* 010 allow consul LAN UDP for 127.0.0.1/8 */
REJECT     tcp  --  anywhere             anywhere             multiport dports 8301 /* 020 block all other consul LAN TCP */ reject-with icmp-port-unreachable
REJECT     udp  --  anywhere             anywhere             multiport dports 8301 /* 020 block all other consul LAN UDP */ reject-with icmp-port-unreachable
slackpad commented 9 years ago

It's good you have inbound TCP and UDP for 8301, are there any rules that would affect outbound traffic on these machines? Also, could you please provide some more logs from before a node is marked as failed so we can try to see what's going on? Thanks!

calvn commented 9 years ago

This is what I am getting from the logs on the service that is running consul lock (https://github.com/Cimpress-MCP/puppet-consul_replicate/blob/master/templates/upstart.erb). I believe upstart attempts 10 times before stopping the service, so that's why there is the ten attempts. I am wondering if I an spread out the retries long enough so that if there is a blip in the network, it will not stop the service but simply go back to a Attempting lock acquisition state.

@slackpad there shouldn't be anything blocking the outbound traffic, but let me double check and see if I can grab some more logs.

Lock lost, killing child
Terminating child pid 21841
Error running handler: signal: terminated
signal: terminated
Child terminated
Cleanup succeeded
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
Setting up lock at path: locks/replicate/.lock
Attempting lock acquisition
Lock acquisition failed: failed to create session: Unexpected response code: 500 (rpc error: Check 'serfHealth' is in critical state)
slackpad commented 9 years ago

Hi @cleung2010 - yeah we should take a look at the Consul logs on the agent node and the server to see when and why the cluster thinks the agent node is unhealthy.

calvn commented 9 years ago

@slackpad actually the consul-replicate service is running on the server node (so in this case the agent node is also the server?). I got some consul logs from around the time the service stop was reported. Unfortunately, the consul-replicate logs didn't have a timestamp (as seem from the previous comment's log), since consul lock -verbose doesn't log time on output. There is a lot of flappiness after 23:14:11, but I am not sure what it attributing to it. I truncated the logs, but that level of flappiness still persists at this point.

Side note, it would be nice if consul lock -verbose would output timestamp + [DEBUG] tag like consul/consul-replicate does with -log-level tag.

2015/08/17 23:09:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36196
2015/08/17 23:10:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36197
2015/08/17 23:10:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36199
2015/08/17 23:10:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36206
2015/08/17 23:11:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36207
2015/08/17 23:11:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36209
2015/08/17 23:11:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36218
2015/08/17 23:12:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36220
2015/08/17 23:12:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36222
2015/08/17 23:12:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36229
2015/08/17 23:13:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36230
2015/08/17 23:13:36 [INFO] memberlist: Suspect prdirlicecns005.irl has failed, no acks received
2015/08/17 23:13:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36232
2015/08/17 23:13:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36234
2015/08/17 23:14:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36235
2015/08/17 23:14:11 [INFO] memberlist: Suspect prdirlicecns002.irl has failed, no acks received
2015/08/17 23:14:12 [INFO] memberlist: Marking prdirlicecns004.irl as failed, suspect timeout reached
2015/08/17 23:14:12 [INFO] serf: EventMemberFailed: prdirlicecns004.irl 10.52.64.177
2015/08/17 23:14:12 [INFO] consul: removing server prdirlicecns004.irl (Addr: 10.52.64.177:8300) (DC: irl)
2015/08/17 23:14:15 [INFO] serf: EventMemberFailed: prdirlicecns001.irl 10.52.32.103
2015/08/17 23:14:15 [INFO] consul: removing server prdirlicecns001.irl (Addr: 10.52.32.103:8300) (DC: irl)
2015/08/17 23:14:16 [INFO] memberlist: Marking prdirlicecns003.irl as failed, suspect timeout reached
2015/08/17 23:14:16 [INFO] serf: EventMemberFailed: prdirlicecns003.irl 10.52.64.178
2015/08/17 23:14:16 [INFO] consul: removing server prdirlicecns003.irl (Addr: 10.52.64.178:8300) (DC: irl)
2015/08/17 23:14:19 [INFO] serf: EventMemberFailed: prdirlicecns005.irl 10.52.98.182
2015/08/17 23:14:19 [INFO] consul: removing server prdirlicecns005.irl (Addr: 10.52.98.182:8300) (DC: irl)
2015/08/17 23:14:20 [INFO] memberlist: Marking prdirlicecns002.irl as failed, suspect timeout reached
2015/08/17 23:14:20 [INFO] serf: EventMemberFailed: prdirlicecns002.irl 10.52.33.177
2015/08/17 23:14:20 [INFO] consul: removing server prdirlicecns002.irl (Addr: 10.52.33.177:8300) (DC: irl)
2015/08/17 23:14:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36237
2015/08/17 23:14:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36240
2015/08/17 23:15:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36241
2015/08/17 23:15:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36243
2015/08/17 23:15:48 [WARN] consul.rpc: RPC request for DC 'irl', no path found
2015/08/17 23:15:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36249
2015/08/17 23:16:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36251
2015/08/17 23:16:24 [INFO] serf: EventMemberFailed: prdbhiicecns001.bhi 10.128.84.1
2015/08/17 23:16:24 [INFO] consul: removing server prdbhiicecns001.bhi (Addr: 10.128.84.1:8300) (DC: bhi)
2015/08/17 23:16:26 [INFO] serf: EventMemberFailed: prdwtricecns004.wtr 10.38.84.27
2015/08/17 23:16:26 [INFO] consul: removing server prdwtricecns004.wtr (Addr: 10.38.84.27:8300) (DC: wtr)
2015/08/17 23:16:27 [INFO] serf: EventMemberFailed: prddpkicecns004.dpk 10.61.84.4
2015/08/17 23:16:27 [INFO] consul: removing server prddpkicecns004.dpk (Addr: 10.61.84.4:8300) (DC: dpk)
2015/08/17 23:16:27 [INFO] serf: EventMemberJoin: prdirlicecns003.irl 10.52.64.178
2015/08/17 23:16:27 [INFO] consul: adding server prdirlicecns003.irl (Addr: 10.52.64.178:8300) (DC: irl)
2015/08/17 23:16:28 [INFO] serf: EventMemberJoin: prdirlicecns004.irl 10.52.64.177
2015/08/17 23:16:28 [INFO] consul: adding server prdirlicecns004.irl (Addr: 10.52.64.177:8300) (DC: irl)
2015/08/17 23:16:29 [INFO] serf: EventMemberJoin: prdirlicecns002.irl 10.52.33.177
2015/08/17 23:16:29 [INFO] consul: adding server prdirlicecns002.irl (Addr: 10.52.33.177:8300) (DC: irl)
2015/08/17 23:16:29 [INFO] serf: EventMemberJoin: prdirlicecns005.irl 10.52.98.182
2015/08/17 23:16:29 [INFO] consul: adding server prdirlicecns005.irl (Addr: 10.52.98.182:8300) (DC: irl)
2015/08/17 23:16:31 [INFO] serf: EventMemberFailed: prdkisicecns004.kis 10.131.84.32
2015/08/17 23:16:31 [INFO] consul: removing server prdkisicecns004.kis (Addr: 10.131.84.32:8300) (DC: kis)
2015/08/17 23:16:31 [INFO] serf: EventMemberJoin: prdirlicecns001.irl 10.52.32.103
2015/08/17 23:16:31 [INFO] consul: adding server prdirlicecns001.irl (Addr: 10.52.32.103:8300) (DC: irl)
2015/08/17 23:16:32 [INFO] serf: EventMemberFailed: prdwndicecns001.wnd 10.71.84.38
2015/08/17 23:16:32 [INFO] consul: removing server prdwndicecns001.wnd (Addr: 10.71.84.38:8300) (DC: wnd)
2015/08/17 23:16:34 [INFO] serf: EventMemberJoin: prddpkicecns004.dpk 10.61.84.4
2015/08/17 23:16:34 [INFO] consul: adding server prddpkicecns004.dpk (Addr: 10.61.84.4:8300) (DC: dpk)
2015/08/17 23:16:36 [INFO] serf: EventMemberFailed: prddpkicecns005.dpk 10.61.84.5
2015/08/17 23:16:36 [INFO] serf: EventMemberFailed: prddpkicecns002.dpk 10.61.84.2
2015/08/17 23:16:36 [INFO] consul: removing server prddpkicecns005.dpk (Addr: 10.61.84.5:8300) (DC: dpk)
2015/08/17 23:16:36 [INFO] consul: removing server prddpkicecns002.dpk (Addr: 10.61.84.2:8300) (DC: dpk)
2015/08/17 23:16:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36253
2015/08/17 23:16:38 [INFO] serf: EventMemberJoin: prddpkicecns002.dpk 10.61.84.2
2015/08/17 23:16:38 [INFO] consul: adding server prddpkicecns002.dpk (Addr: 10.61.84.2:8300) (DC: dpk)
2015/08/17 23:16:41 [INFO] serf: EventMemberFailed: prdlexicecns005.lex 10.89.184.210
2015/08/17 23:16:41 [INFO] consul: removing server prdlexicecns005.lex (Addr: 10.89.184.210:8300) (DC: lex)
2015/08/17 23:16:43 [INFO] serf: EventMemberJoin: prdwndicecns001.wnd 10.71.84.38
2015/08/17 23:16:43 [INFO] consul: adding server prdwndicecns001.wnd (Addr: 10.71.84.38:8300) (DC: wnd)
2015/08/17 23:16:46 [INFO] serf: EventMemberFailed: prdkisicecns005.kis 10.131.84.33
2015/08/17 23:16:46 [INFO] consul: removing server prdkisicecns005.kis (Addr: 10.131.84.33:8300) (DC: kis)
2015/08/17 23:16:49 [INFO] serf: EventMemberJoin: prdkisicecns005.kis 10.131.84.33
2015/08/17 23:16:49 [INFO] serf: EventMemberFailed: prdwtricecns005.wtr 10.38.84.31
2015/08/17 23:16:49 [INFO] consul: adding server prdkisicecns005.kis (Addr: 10.131.84.33:8300) (DC: kis)
2015/08/17 23:16:49 [INFO] consul: removing server prdwtricecns005.wtr (Addr: 10.38.84.31:8300) (DC: wtr)
2015/08/17 23:16:49 [WARN] memberlist: Refuting a suspect message (from: prddpkicecns004.dpk)
2015/08/17 23:16:50 [INFO] serf: EventMemberJoin: prdwtricecns004.wtr 10.38.84.27
2015/08/17 23:16:50 [INFO] consul: adding server prdwtricecns004.wtr (Addr: 10.38.84.27:8300) (DC: wtr)
2015/08/17 23:16:50 [INFO] serf: EventMemberJoin: prddpkicecns005.dpk 10.61.84.5
2015/08/17 23:16:50 [INFO] consul: adding server prddpkicecns005.dpk (Addr: 10.61.84.5:8300) (DC: dpk)
2015/08/17 23:16:51 [INFO] serf: EventMemberFailed: prdqasicecns003.qas 10.66.12.149
2015/08/17 23:16:51 [INFO] consul: removing server prdqasicecns003.qas (Addr: 10.66.12.149:8300) (DC: qas)
2015/08/17 23:16:52 [INFO] serf: EventMemberJoin: prdqasicecns003.qas 10.66.12.149
2015/08/17 23:16:52 [INFO] consul: adding server prdqasicecns003.qas (Addr: 10.66.12.149:8300) (DC: qas)
2015/08/17 23:16:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36257
2015/08/17 23:16:54 [INFO] serf: EventMemberFailed: prdlexicecns001.lex 10.89.184.71
2015/08/17 23:16:54 [INFO] consul: removing server prdlexicecns001.lex (Addr: 10.89.184.71:8300) (DC: lex)
2015/08/17 23:16:59 [INFO] serf: EventMemberJoin: prdbhiicecns001.bhi 10.128.84.1
2015/08/17 23:16:59 [INFO] consul: adding server prdbhiicecns001.bhi (Addr: 10.128.84.1:8300) (DC: bhi)
2015/08/17 23:17:00 [INFO] serf: EventMemberJoin: prdwtricecns005.wtr 10.38.84.31
2015/08/17 23:17:00 [INFO] consul: adding server prdwtricecns005.wtr (Addr: 10.38.84.31:8300) (DC: wtr)
2015/08/17 23:17:02 [INFO] serf: EventMemberFailed: prdqasicecns005.qas 10.66.47.141
2015/08/17 23:17:02 [INFO] serf: EventMemberFailed: prdqasicecns004.qas 10.66.47.159
2015/08/17 23:17:02 [INFO] serf: EventMemberFailed: prdkisicecns003.kis 10.131.84.31
2015/08/17 23:17:02 [INFO] serf: EventMemberFailed: prdbhiicecns003.bhi 10.128.84.3
2015/08/17 23:17:02 [INFO] serf: EventMemberFailed: prdirlicecns001.irl 10.52.32.103
2015/08/17 23:17:02 [INFO] consul: removing server prdqasicecns005.qas (Addr: 10.66.47.141:8300) (DC: qas)
2015/08/17 23:17:02 [INFO] consul: removing server prdqasicecns004.qas (Addr: 10.66.47.159:8300) (DC: qas)
2015/08/17 23:17:02 [INFO] consul: removing server prdkisicecns003.kis (Addr: 10.131.84.31:8300) (DC: kis)
2015/08/17 23:17:02 [INFO] consul: removing server prdbhiicecns003.bhi (Addr: 10.128.84.3:8300) (DC: bhi)
2015/08/17 23:17:02 [INFO] consul: removing server prdirlicecns001.irl (Addr: 10.52.32.103:8300) (DC: irl)
2015/08/17 23:17:03 [INFO] serf: EventMemberFailed: prdbdaicecns003.bda 10.10.84.170
2015/08/17 23:17:03 [INFO] serf: EventMemberFailed: prdwndicecns003.wnd 10.71.84.40
2015/08/17 23:17:03 [INFO] consul: removing server prdbdaicecns003.bda (Addr: 10.10.84.170:8300) (DC: bda)
2015/08/17 23:17:03 [INFO] consul: removing server prdwndicecns003.wnd (Addr: 10.71.84.40:8300) (DC: wnd)
2015/08/17 23:17:03 [INFO] serf: EventMemberJoin: prdbhiicecns003.bhi 10.128.84.3
2015/08/17 23:17:03 [INFO] consul: adding server prdbhiicecns003.bhi (Addr: 10.128.84.3:8300) (DC: bhi)
2015/08/17 23:17:04 [INFO] serf: EventMemberJoin: prdwndicecns003.wnd 10.71.84.40
2015/08/17 23:17:04 [INFO] consul: adding server prdwndicecns003.wnd (Addr: 10.71.84.40:8300) (DC: wnd)
2015/08/17 23:17:06 [INFO] serf: EventMemberJoin: prdqasicecns005.qas 10.66.47.141
2015/08/17 23:17:06 [INFO] serf: EventMemberJoin: prdkisicecns004.kis 10.131.84.32
2015/08/17 23:17:06 [INFO] consul: adding server prdqasicecns005.qas (Addr: 10.66.47.141:8300) (DC: qas)
2015/08/17 23:17:06 [INFO] consul: adding server prdkisicecns004.kis (Addr: 10.131.84.32:8300) (DC: kis)
2015/08/17 23:17:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36258
2015/08/17 23:17:16 [INFO] serf: EventMemberJoin: prdqasicecns004.qas 10.66.47.159
2015/08/17 23:17:16 [INFO] consul: adding server prdqasicecns004.qas (Addr: 10.66.47.159:8300) (DC: qas)
2015/08/17 23:17:17 [INFO] serf: EventMemberJoin: prdlexicecns001.lex 10.89.184.71
2015/08/17 23:17:17 [INFO] consul: adding server prdlexicecns001.lex (Addr: 10.89.184.71:8300) (DC: lex)
2015/08/17 23:17:19 [INFO] serf: EventMemberJoin: prdkisicecns003.kis 10.131.84.31
2015/08/17 23:17:19 [INFO] consul: adding server prdkisicecns003.kis (Addr: 10.131.84.31:8300) (DC: kis)
2015/08/17 23:17:21 [INFO] serf: EventMemberJoin: prdbdaicecns003.bda 10.10.84.170
2015/08/17 23:17:21 [INFO] consul: adding server prdbdaicecns003.bda (Addr: 10.10.84.170:8300) (DC: bda)
2015/08/17 23:17:22 [INFO] serf: EventMemberJoin: prdirlicecns001.irl 10.52.32.103
2015/08/17 23:17:22 [INFO] consul: adding server prdirlicecns001.irl (Addr: 10.52.32.103:8300) (DC: irl)
2015/08/17 23:17:31 [INFO] serf: EventMemberFailed: prdbdaicecns002.bda 10.10.84.169
2015/08/17 23:17:31 [INFO] consul: removing server prdbdaicecns002.bda (Addr: 10.10.84.169:8300) (DC: bda)
2015/08/17 23:17:33 [INFO] serf: EventMemberJoin: prdlexicecns005.lex 10.89.184.210
2015/08/17 23:17:33 [INFO] consul: adding server prdlexicecns005.lex (Addr: 10.89.184.210:8300) (DC: lex)
2015/08/17 23:17:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36261
2015/08/17 23:17:47 [INFO] serf: EventMemberJoin: prdbdaicecns002.bda 10.10.84.169
2015/08/17 23:17:47 [INFO] consul: adding server prdbdaicecns002.bda (Addr: 10.10.84.169:8300) (DC: bda)
2015/08/17 23:17:48 [INFO] serf: EventMemberFailed: prdwtricecns002.wtr 10.38.84.29
2015/08/17 23:17:48 [INFO] serf: EventMemberFailed: prdvenicecns002.ven 10.35.84.29
2015/08/17 23:17:48 [INFO] consul: removing server prdwtricecns002.wtr (Addr: 10.38.84.29:8300) (DC: wtr)
2015/08/17 23:17:48 [INFO] consul: removing server prdvenicecns002.ven (Addr: 10.35.84.29:8300) (DC: ven)
2015/08/17 23:17:52 [INFO] serf: EventMemberJoin: prdwtricecns002.wtr 10.38.84.29
2015/08/17 23:17:52 [INFO] consul: adding server prdwtricecns002.wtr (Addr: 10.38.84.29:8300) (DC: wtr)
2015/08/17 23:17:53 [INFO] serf: EventMemberFailed: prdbdaicecns004.bda 10.10.84.171
2015/08/17 23:17:53 [INFO] consul: removing server prdbdaicecns004.bda (Addr: 10.10.84.171:8300) (DC: bda)
2015/08/17 23:17:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36270
2015/08/17 23:17:54 [INFO] serf: EventMemberJoin: prdbdaicecns004.bda 10.10.84.171
2015/08/17 23:17:54 [INFO] consul: adding server prdbdaicecns004.bda (Addr: 10.10.84.171:8300) (DC: bda)
2015/08/17 23:18:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36272
2015/08/17 23:18:11 [INFO] memberlist: Marking prdqasicecns001.qas as failed, suspect timeout reached
2015/08/17 23:18:11 [INFO] serf: EventMemberFailed: prdqasicecns001.qas 10.66.30.221
2015/08/17 23:18:11 [INFO] memberlist: Marking prdwndicecns004.wnd as failed, suspect timeout reached
2015/08/17 23:18:11 [INFO] serf: EventMemberFailed: prdwndicecns004.wnd 10.71.84.41
2015/08/17 23:18:11 [INFO] consul: removing server prdqasicecns001.qas (Addr: 10.66.30.221:8300) (DC: qas)
2015/08/17 23:18:11 [INFO] consul: removing server prdwndicecns004.wnd (Addr: 10.71.84.41:8300) (DC: wnd)
2015/08/17 23:18:19 [INFO] serf: EventMemberJoin: prdvenicecns002.ven 10.35.84.29
2015/08/17 23:18:19 [INFO] consul: adding server prdvenicecns002.ven (Addr: 10.35.84.29:8300) (DC: ven)
2015/08/17 23:18:29 [INFO] memberlist: Marking prdirlicecns001.irl as failed, suspect timeout reached
2015/08/17 23:18:29 [INFO] serf: EventMemberFailed: prdirlicecns001.irl 10.52.32.103
2015/08/17 23:18:29 [INFO] consul: removing server prdirlicecns001.irl (Addr: 10.52.32.103:8300) (DC: irl)
2015/08/17 23:18:35 [INFO] serf: EventMemberJoin: prdqasicecns001.qas 10.66.30.221
2015/08/17 23:18:35 [INFO] consul: adding server prdqasicecns001.qas (Addr: 10.66.30.221:8300) (DC: qas)
2015/08/17 23:18:36 [INFO] serf: EventMemberFailed: prdwtricecns004.wtr 10.38.84.27
2015/08/17 23:18:36 [INFO] consul: removing server prdwtricecns004.wtr (Addr: 10.38.84.27:8300) (DC: wtr)
2015/08/17 23:18:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36274
2015/08/17 23:18:43 [INFO] memberlist: Marking prdvenicecns004.ven as failed, suspect timeout reached
2015/08/17 23:18:43 [INFO] serf: EventMemberFailed: prdvenicecns004.ven 10.35.84.31
2015/08/17 23:18:43 [INFO] consul: removing server prdvenicecns004.ven (Addr: 10.35.84.31:8300) (DC: ven)
2015/08/17 23:18:46 [INFO] serf: EventMemberFailed: prdirlicecns002.irl 10.52.33.177
2015/08/17 23:18:46 [INFO] consul: removing server prdirlicecns002.irl (Addr: 10.52.33.177:8300) (DC: irl)
2015/08/17 23:18:49 [INFO] serf: EventMemberJoin: prdirlicecns002.irl 10.52.33.177
2015/08/17 23:18:49 [INFO] consul: adding server prdirlicecns002.irl (Addr: 10.52.33.177:8300) (DC: irl)
2015/08/17 23:18:51 [INFO] serf: EventMemberFailed: prdqasicecns005.qas 10.66.47.141
2015/08/17 23:18:51 [INFO] consul: removing server prdqasicecns005.qas (Addr: 10.66.47.141:8300) (DC: qas)
2015/08/17 23:18:52 [INFO] serf: EventMemberFailed: prdqasicecns003.qas 10.66.12.149
2015/08/17 23:18:52 [INFO] consul: removing server prdqasicecns003.qas (Addr: 10.66.12.149:8300) (DC: qas)
2015/08/17 23:18:52 [INFO] serf: attempting reconnect to prdwtricecns004.wtr 10.38.84.27:8302
2015/08/17 23:18:53 [INFO] serf: EventMemberJoin: prdwtricecns004.wtr 10.38.84.27
2015/08/17 23:18:53 [INFO] consul: adding server prdwtricecns004.wtr (Addr: 10.38.84.27:8300) (DC: wtr)
2015/08/17 23:18:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36278
2015/08/17 23:18:54 [INFO] serf: EventMemberJoin: prdwndicecns004.wnd 10.71.84.41
2015/08/17 23:18:54 [INFO] consul: adding server prdwndicecns004.wnd (Addr: 10.71.84.41:8300) (DC: wnd)
2015/08/17 23:18:56 [INFO] serf: EventMemberJoin: prdqasicecns005.qas 10.66.47.141
2015/08/17 23:18:56 [INFO] serf: EventMemberFailed: prdkisicecns001.kis 10.131.84.29
2015/08/17 23:18:56 [INFO] consul: adding server prdqasicecns005.qas (Addr: 10.66.47.141:8300) (DC: qas)
2015/08/17 23:18:56 [INFO] consul: removing server prdkisicecns001.kis (Addr: 10.131.84.29:8300) (DC: kis)
2015/08/17 23:18:57 [INFO] serf: EventMemberJoin: prdkisicecns001.kis 10.131.84.29
2015/08/17 23:18:57 [INFO] serf: EventMemberJoin: prdvenicecns004.ven 10.35.84.31
2015/08/17 23:18:57 [INFO] consul: adding server prdkisicecns001.kis (Addr: 10.131.84.29:8300) (DC: kis)
2015/08/17 23:18:57 [INFO] consul: adding server prdvenicecns004.ven (Addr: 10.35.84.31:8300) (DC: ven)
2015/08/17 23:19:10 [INFO] agent.rpc: Accepted client: 127.0.0.1:36279
2015/08/17 23:19:13 [INFO] serf: EventMemberJoin: prdqasicecns003.qas 10.66.12.149
2015/08/17 23:19:13 [INFO] consul: adding server prdqasicecns003.qas (Addr: 10.66.12.149:8300) (DC: qas)
2015/08/17 23:19:22 [INFO] serf: EventMemberJoin: prdirlicecns001.irl 10.52.32.103
2015/08/17 23:19:22 [INFO] consul: adding server prdirlicecns001.irl (Addr: 10.52.32.103:8300) (DC: irl)
2015/08/17 23:19:37 [INFO] agent.rpc: Accepted client: 127.0.0.1:36281
2015/08/17 23:19:53 [INFO] agent.rpc: Accepted client: 127.0.0.1:36283
calvn commented 9 years ago

@slackpad I was able to pinpoint and find the precise consul logs for when the cluster lost its leader (thankful for email alerting!).

2015/08/15 10:17:50 [INFO] agent.rpc: Accepted client: 127.0.0.1:37036
2015/08/15 10:18:06 [INFO] agent.rpc: Accepted client: 127.0.0.1:37038
2015/08/15 10:18:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:37039
2015/08/15 10:18:50 [INFO] agent.rpc: Accepted client: 127.0.0.1:37041
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.30.221:8300 in 500.143217ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.47.141:8300 in 503.438101ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.30.221:8300 in 510.708009ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.47.141:8300 in 540.850776ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.47.159:8300 in 500.608697ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.30.221:8300 in 548.120684ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact quorum of nodes, stepping down
2015/08/15 10:18:52 [INFO] raft: Node at 10.66.28.125:8300 [Follower] entering Follower state
2015/08/15 10:18:52 [INFO] consul: cluster leadership lost
2015/08/15 10:18:52 [INFO] raft: aborting pipeline replication to peer 10.66.47.159:8300
2015/08/15 10:18:52 [INFO] raft: aborting pipeline replication to peer 10.66.30.221:8300
2015/08/15 10:18:52 [INFO] raft: aborting pipeline replication to peer 10.66.12.149:8300
2015/08/15 10:18:52 [INFO] raft: aborting pipeline replication to peer 10.66.47.141:8300
2015/08/15 10:18:52 [ERR] http: Request /v1/session/renew/3a5f1a2e-b8b4-ad8d-c869-c2b2997d51d2?token=<hidden>, error: No cluster leader
2015/08/15 10:18:53 [ERR] http: Request /v1/session/renew/3a5f1a2e-b8b4-ad8d-c869-c2b2997d51d2?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [WARN] raft: Heartbeat timeout reached, starting election
2015/08/15 10:18:54 [INFO] raft: Node at 10.66.28.125:8300 [Candidate] entering Candidate state
2015/08/15 10:18:54 [ERR] http: Request /v1/kv/locks/replicate/.lock?index=328128&token=<hidden>&wait=15000ms, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:54 [ERR] http: Request /v1/session/create?token=<hidden>, error: No cluster leader
2015/08/15 10:18:55 [WARN] raft: Election timeout reached, restarting election
2015/08/15 10:18:55 [INFO] raft: Node at 10.66.28.125:8300 [Candidate] entering Candidate state
2015/08/15 10:18:55 [INFO] raft: Duplicate RequestVote for same term: 113
2015/08/15 10:18:55 [INFO] raft: Election won. Tally: 3
2015/08/15 10:18:55 [INFO] raft: Node at 10.66.28.125:8300 [Leader] entering Leader state
2015/08/15 10:18:55 [INFO] consul: cluster leadership acquired
2015/08/15 10:18:55 [INFO] consul: New leader elected: prdqasicecns002
2015/08/15 10:18:55 [INFO] raft: pipelining replication to peer 10.66.47.159:8300
2015/08/15 10:18:55 [INFO] raft: pipelining replication to peer 10.66.12.149:8300
2015/08/15 10:18:55 [INFO] raft: pipelining replication to peer 10.66.30.221:8300
2015/08/15 10:18:55 [INFO] raft: pipelining replication to peer 10.66.47.141:8300
2015/08/15 10:19:05 [ERR] raft: Failed to heartbeat to 10.66.12.149:8300: read tcp 10.66.12.149:8300: i/o timeout
2015/08/15 10:19:06 [INFO] agent.rpc: Accepted client: 127.0.0.1:37057
2015/08/15 10:19:15 [ERR] raft: Failed to heartbeat to 10.66.12.149:8300: read tcp 10.66.12.149:8300: i/o timeout
2015/08/15 10:19:34 [INFO] agent.rpc: Accepted client: 127.0.0.1:37059
2015/08/15 10:19:50 [INFO] agent.rpc: Accepted client: 127.0.0.1:37062
slackpad commented 9 years ago

Hmm - it looks like TCP connections are dying from that last log. As a sanity check, can you take a look at this issue and see if you show any "skb rides the rocket" errors in syslog around the same time?

https://github.com/hashicorp/consul/issues/1154#issuecomment-133117548

That log before that shows memberlist pings failing which are UDP, which I don't think are subject to the Xen bug, so I can't explain that yet.

calvn commented 9 years ago

@slackpad I don't see any "skb rides the rocket" errors in dmesg or /var/log/kern.log but I did see the following output that is related to consul-replicate and potentially to this error:

[46296492.477890] init: consul main process (11387) terminated with status 1
[47193121.638505] init: consul-replicate main process ended, respawning
[47573559.661632] init: consul-replicate main process ended, respawning
[47739683.989435] init: consul-replicate main process ended, respawning
[47739684.039383] init: consul-replicate main process (14653) terminated with status 1
[47739684.039413] init: consul-replicate main process ended, respawning
[47739684.089172] init: consul-replicate main process (14662) terminated with status 1
[47739684.089193] init: consul-replicate main process ended, respawning
[47739684.138931] init: consul-replicate main process (14668) terminated with status 1
[47739684.138963] init: consul-replicate main process ended, respawning
[47739684.188761] init: consul-replicate main process (14674) terminated with status 1
[47739684.188782] init: consul-replicate main process ended, respawning
[47739684.241374] init: consul-replicate main process (14680) terminated with status 1
[47739684.241398] init: consul-replicate main process ended, respawning
[47739684.292842] init: consul-replicate main process (14687) terminated with status 1
[47739684.292864] init: consul-replicate main process ended, respawning
[47739684.343908] init: consul-replicate main process (14694) terminated with status 1
[47739684.343932] init: consul-replicate main process ended, respawning
[47739684.395458] init: consul-replicate main process (14701) terminated with status 1
[47739684.395481] init: consul-replicate main process ended, respawning
[47739684.445590] init: consul-replicate main process (14707) terminated with status 1
[47739684.445621] init: consul-replicate main process ended, respawning
[47739684.496419] init: consul-replicate main process (14713) terminated with status 1
[47739684.496440] init: consul-replicate respawning too fast, stopped
[49556742.120958] init: consul main process (6197) terminated with status 1
[52585340.106021] type=1400 audit(1435163188.068:15): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/usr/sbin/named" pid=25382 comm="apparmor_parser"
[52585340.234605] type=1400 audit(1435163188.196:16): apparmor="DENIED" operation="open" profile="/usr/sbin/named" name="/etc/pbis/user-ignore" pid=25410 comm="named" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[52585345.021046] type=1400 audit(1435163192.984:17): apparmor="DENIED" operation="open" profile="/usr/sbin/named" name="/etc/pbis/user-ignore" pid=25504 comm="named" requested_mask="r" denied_mask="r" fsuid=0 ouid=0
[52585346.208842] type=1400 audit(1435163194.172:18): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/sbin/dhclient" pid=25546 comm="apparmor_parser"
[52585346.209069] type=1400 audit(1435163194.172:19): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/NetworkManager/nm-dhcp-client.action" pid=25546 comm="apparmor_parser"
[52585346.209241] type=1400 audit(1435163194.172:20): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/lib/connman/scripts/dhclient-script" pid=25546 comm="apparmor_parser"
[52585346.351189] type=1400 audit(1435163194.316:21): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/named" pid=25547 comm="apparmor_parser"
[52585346.529650] type=1400 audit(1435163194.492:22): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/usr/sbin/tcpdump" pid=25549 comm="apparmor_parser"
[54562028.130986] init: consul main process (2166) terminated with status 1
[57056086.320212] init: consul-replicate main process (29528) terminated with status 1
[57056086.320237] init: consul-replicate main process ended, respawning
[57056086.359579] init: consul-replicate main process (9508) terminated with status 1
[57056086.359600] init: consul-replicate main process ended, respawning
[57056086.397569] init: consul-replicate main process (9517) terminated with status 1
[57056086.397590] init: consul-replicate main process ended, respawning
[57056086.435411] init: consul-replicate main process (9524) terminated with status 1
[57056086.435432] init: consul-replicate main process ended, respawning
[57056086.472970] init: consul-replicate main process (9531) terminated with status 1
[57056086.472991] init: consul-replicate main process ended, respawning
[57056086.510526] init: consul-replicate main process (9537) terminated with status 1
[57056086.510547] init: consul-replicate main process ended, respawning
[57056086.573619] init: consul-replicate main process (9545) terminated with status 1
[57056086.573641] init: consul-replicate main process ended, respawning
[57056086.630783] init: consul-replicate main process (9555) terminated with status 1
[57056086.630805] init: consul-replicate main process ended, respawning
[57056086.688637] init: consul-replicate main process (9563) terminated with status 1
[57056086.688659] init: consul-replicate main process ended, respawning
[57056086.750700] init: consul-replicate main process (9572) terminated with status 1
[57056086.750723] init: consul-replicate main process ended, respawning
[57056086.810589] init: consul-replicate main process (9580) terminated with status 1
[57056086.810610] init: consul-replicate respawning too fast, stopped
slackpad commented 9 years ago

@cleung2010 It's hard to tell from those logs whether the high restart rate is part of the cause, or just a symptom because the leadership is lost and it can't get the locks it needs. This part of the logs above makes me think there's still some networking problem at play:

2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.30.221:8300 in 500.143217ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.47.141:8300 in 503.438101ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.30.221:8300 in 510.708009ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.47.141:8300 in 540.850776ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.47.159:8300 in 500.608697ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact 10.66.30.221:8300 in 548.120684ms
2015/08/15 10:18:52 [WARN] raft: Failed to contact quorum of nodes, stepping down

The other option is maybe the CPU is starved or something so Consul isn't meeting its timing requirements. Do you have any other data about those nodes at the time of these events that we might be able to correlate? Also, if you can reproduce it perhaps we could get some tcpdump data to look at and see what's going on with those network connections. If you want to send any data along privately you can contact support@hashicorp.com and CC james at hashicorp.com.

sethvargo commented 8 years ago

Closing due to inactivity. Thanks!