hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.4k stars 4.43k forks source link

agent endpoint hanging #1534

Closed shimzim closed 4 years ago

shimzim commented 8 years ago

Following the discussion here: https://groups.google.com/d/topic/consul-tool/SJYfMMcE9xI/discussion

We are powering off all the nodes of a 5-node cluster, and powering them back up (this is an instant powerdown, not a graceful one, so the nodes rejoin the cluster after we restart them). After the cluster crash, "consul members" seems fine, but in about 1 of every 3 cases, we see that the following HTTP requests are hanging: http://localhost:8500/v1/agent/services http://localhost:8500/v1/agent/checks In comparison, the /v1/agent/self request works fine.

This happens with 0.6.0, but also with older versions (0.5.0).

Per @slackpad 's recommendation, we collected debug info about the go routines. If I run the debug/pprof/goroutine command a few times, while the HTTP request is hanging, I see that over time the number of go routines is growing all the time.

Attached is the latest output.

TIA. goroutines.txt

xakraz commented 8 years ago

It may be related to #795 ?

slackpad commented 8 years ago

Hi @shimshon-stratoscale sorry it took a while to get to this but I think it looks like writes are not making any progress and that's why things are stuck.

Here's one goroutine that has the local lock and is waiting on a catalog write to complete:

goroutine 161 [chan receive, 21 minutes]:
github.com/hashicorp/raft.(*deferError).Error(0xc820565ae0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/raft/future.go:56 +0xd5
github.com/hashicorp/consul/consul.(*Server).raftApply(0xc820001e00, 0x0, 0xc4e180, 0xc8202a64d0, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/consul/rpc.go:293 +0x343
github.com/hashicorp/consul/consul.(*Catalog).Register(0xc82001e6d8, 0xc8202a64d0, 0x11457c8, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/consul/catalog_endpoint.go:67 +0x545
reflect.Value.call(0xb8f5e0, 0xc948b0, 0x13, 0xd073d0, 0x4, 0xc8207611c8, 0x3, 0x3, 0x0, 0x0, ...)
    /usr/local/go/src/reflect/value.go:432 +0x120a
reflect.Value.Call(0xb8f5e0, 0xc948b0, 0x13, 0xc8207611c8, 0x3, 0x3, 0x0, 0x0, 0x0)
    /usr/local/go/src/reflect/value.go:300 +0xb1
net/rpc.(*service).call(0xc8201e9380, 0xc8201e91c0, 0xc8204e3e80, 0xc820298700, 0xc820097220, 0xc4e180, 0xc8202a64d0, 0x16, 0xa81a60, 0x11457c8, ...)
    /opt/go/src/net/rpc/server.go:383 +0x1c1
net/rpc.(*Server).ServeRequest(0xc8201e91c0, 0x7f0215968358, 0xc8200b2700, 0x0, 0x0)
    /opt/go/src/net/rpc/server.go:498 +0x294
github.com/hashicorp/consul/consul.(*Server).RPC(0xc820001e00, 0xd8e5b0, 0x10, 0xc4e180, 0xc8202a6460, 0xa81a60, 0x11457c8, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/consul/server.go:686 +0x109
github.com/hashicorp/consul/command/agent.(*localState).syncService(0xc8200750b0, 0xd0b7e0, 0x6, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/local.go:552 +0x4d1
github.com/hashicorp/consul/command/agent.(*localState).syncChanges(0xc8200750b0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/local.go:449 +0x572
github.com/hashicorp/consul/command/agent.(*HTTPServer).syncChanges(0xc820343f90)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/agent_endpoint.go:342 +0x3b
github.com/hashicorp/consul/command/agent.(*HTTPServer).AgentRegisterService(0xc820343f90, 0x7f0215968770, 0xc820323130, 0xc8200a6d20, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/agent_endpoint.go:236 +0x8db
github.com/hashicorp/consul/command/agent.(*HTTPServer).AgentRegisterService-fm(0x7f0215968770, 0xc820323130, 0xc8200a6d20, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/http.go:236 +0x5a
github.com/hashicorp/consul/command/agent.(*HTTPServer).wrap.func1(0x7f0215968770, 0xc820323130, 0xc8200a6d20)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/http.go:327 +0x45e
net/http.HandlerFunc.ServeHTTP(0xc8203fcb40, 0x7f0215968770, 0xc820323130, 0xc8200a6d20)
    /opt/go/src/net/http/server.go:1422 +0x3a
net/http.(*ServeMux).ServeHTTP(0xc8203f86c0, 0x7f0215968770, 0xc820323130, 0xc8200a6d20)
    /opt/go/src/net/http/server.go:1699 +0x17d
net/http.serverHandler.ServeHTTP(0xc8202e7020, 0x7f0215968770, 0xc820323130, 0xc8200a6d20)
    /opt/go/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc820323080)
    /opt/go/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
    /opt/go/src/net/http/server.go:1910 +0x3f6

And here's your agent request waiting for that lock to free up:

goroutine 1440 [semacquire, 4 minutes]:
sync.runtime_Semacquire(0xc8200750c0)
    /usr/local/go/src/runtime/sema.go:43 +0x26
sync.(*RWMutex).RLock(0xc8200750b4)
    /usr/local/go/src/sync/rwmutex.go:36 +0x58
github.com/hashicorp/consul/command/agent.(*localState).Services(0xc8200750b0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/local.go:179 +0x81
github.com/hashicorp/consul/command/agent.(*HTTPServer).AgentServices(0xc820343f90, 0x7f0215968770, 0xc821ef4580, 0xc821f16460, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/agent_endpoint.go:36 +0x48
github.com/hashicorp/consul/command/agent.(*HTTPServer).AgentServices-fm(0x7f0215968770, 0xc821ef4580, 0xc821f16460, 0x0, 0x0, 0x0, 0x0)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/http.go:224 +0x5a
github.com/hashicorp/consul/command/agent.(*HTTPServer).wrap.func1(0x7f0215968770, 0xc821ef4580, 0xc821f16460)
    /opt/gopath/src/github.com/hashicorp/consul/command/agent/http.go:327 +0x45e
net/http.HandlerFunc.ServeHTTP(0xc8203fc8a0, 0x7f0215968770, 0xc821ef4580, 0xc821f16460)
    /opt/go/src/net/http/server.go:1422 +0x3a
net/http.(*ServeMux).ServeHTTP(0xc8203f86c0, 0x7f0215968770, 0xc821ef4580, 0xc821f16460)
    /opt/go/src/net/http/server.go:1699 +0x17d
net/http.serverHandler.ServeHTTP(0xc8202e7020, 0x7f0215968770, 0xc821ef4580, 0xc821f16460)
    /opt/go/src/net/http/server.go:1862 +0x19e
net/http.(*conn).serve(0xc821ef4210)
    /opt/go/src/net/http/server.go:1361 +0xbee
created by net/http.(*Server).Serve
    /opt/go/src/net/http/server.go:1910 +0x3f6

Looking at the other goroutine stacks there are also a bunch of KV writes waiting to make progress at all. Since writes are stalled, these requests are piling up.

Sorry I didn't notice this in the thread before, but rebooting all 5 servers at once will put Consul into an outage situation that requires manual recovery and can result in data loss (Consul can only handle the failure of a minority of nodes in a multi-node cluster). In this situation, you'd need to choose one of the servers to be the new leader and start it up in bootstrap mode (you may have to clear peers.json if there are entries in there for the other servers). Then you can re-join the other servers with their data directories cleaned out and they will take on the state of the single leader, and expand the quorum as they are added.

The fact that it works some of the time probably depends on the state of quorum at the time the cluster was killed, but in general Consul can't automatically recover from this. When you are in this situation where requests are hanging, running https://www.consul.io/docs/commands/info.html on each server should show num_peers as too low on some of the servers, and nobody will likely be the leader.

shimzim commented 8 years ago

Thanks @slackpad for the analysis. Indeed in this state we found out we cannot write to the KV store. I had a correspondence on the outage issue with @armon in the past (see here: https://groups.google.com/forum/#!topic/consul-tool/k1ZnMHVwobA ), in which he said that if we restart each of the consul agents with the same flags they had in their previous run, we should be ok - has that changed? BTW - we added a post-restart code that upon detecting that writes are stuck, it restarts the consul process. Sometimes it takes multiple restarts, but eventually it works. However, that's just a workaround. Appreciate your input on this.

slackpad commented 8 years ago

Hmm I may have misunderstood myself the behavior in the case they are all powered off - I thought that would be an outage situation but I suppose if the nodes haven't actually left then it does make sense that they should be able to pick up with the Raft log which is durable. It seems like there's either some edge case in the Raft implementation that's preventing a proper leader election in this case, or perhaps you are unlucky in timing or something and it's taking a much longer time to converge on a proper leader.

Would you happen to have any logs from the servers while this is happening? That might yield some clues about what's going on.

shimzim commented 8 years ago

Sure - here are the logs from a 4-node setup, I added the args with which we ran consul before each of them. The logs are from after the crash:

ARGS: ['consul', 'agent', '-bind=1.77.238.94', '-data-dir=/var/run/consul', '-client=0.0.0.0', '-config-file=/etc/consul_config.json', '-rejoin', '-config-dir=/etc/consul.d', '-server', u'-retry-join=1.77.232.232', u'-retry-join=1.77.233.135', '-retry-interval=5s']

==> Starting raft data migration... ==> Starting Consul agent... ==> Starting Consul agent RPC... ==> Consul agent running! Node name: 'stratonode0' Datacenter: 'stratocluster-9626390526591630310__1_2_da49521d' Server: true (bootstrap: false) Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 53, RPC: 8400) Cluster Addr: 1.77.238.94 (LAN: 8301, WAN: 8302) Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false Atlas:

==> Log data will now stream in as it occurs:

2016/01/25 03:05:28 [INFO] raft: Node at 1.77.238.94:8300 [Follower] entering Follower state
2016/01/25 03:05:28 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode0 1.77.238.94
2016/01/25 03:05:28 [INFO] consul: adding server stratonode0 (Addr: 1.77.238.94:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:05:28 [INFO] serf: Attempting re-join to previously known node: stratonode10472: 1.77.232.232:8301
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode0.stratocluster-9626390526591630310__1_2_da49521d 1.77.238.94
2016/01/25 03:05:28 [WARN] serf: Failed to re-join any previously known node
2016/01/25 03:05:28 [INFO] consul: adding server stratonode0.stratocluster-9626390526591630310__1_2_da49521d (Addr: 1.77.238.94:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [ERR] agent: failed to sync remote state: No cluster leader    2016/01/25 03:05:28 [INFO] serf: Attempting re-join to previously known node: stratonode10631: 1.77.233.135:8301

2016/01/25 03:05:28 [INFO] agent: Joining cluster...
2016/01/25 03:05:28 [INFO] agent: (LAN) joining: [1.77.232.232 1.77.233.135]
2016/01/25 03:05:28 [INFO] serf: Attempting re-join to previously known node: stratonode6577: 1.77.217.177:8301    2016/01/25 03:05:28 [INFO] agent: (LAN) joined: 0 Err: dial tcp 1.77.233.135:8301: connection refused

2016/01/25 03:05:28 [WARN] agent: Join failed: dial tcp 1.77.233.135:8301: connection refused, retrying in 5s
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode10631 1.77.233.135
2016/01/25 03:05:28 [INFO] consul: adding server stratonode10631 (Addr: 1.77.233.135:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
    2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode10472 1.77.232.232
2016/01/25 03:05:28 [INFO] consul: adding server stratonode10472 (Addr: 1.77.232.232:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [ERR] http: Request /v1/kv/dirtyprivates/dontuseneverever, error: No cluster leader
2016/01/25 03:05:28 [ERR] http: Request /v1/kv/stratonode0.node.strato/readyNics, error: No cluster leader
2016/01/25 03:05:28 [ERR] http: Request /v1/kv/dirtyprivates/dontuseneverever, error: No cluster leader
2016/01/25 03:05:29 [WARN] raft: Heartbeat timeout reached, starting election
2016/01/25 03:05:29 [INFO] raft: Node at 1.77.238.94:8300 [Candidate] entering Candidate state
2016/01/25 03:05:29 [INFO] raft: Election won. Tally: 2
2016/01/25 03:05:29 [INFO] raft: Node at 1.77.238.94:8300 [Leader] entering Leader state
2016/01/25 03:05:29 [INFO] consul: cluster leadership acquired
2016/01/25 03:05:29 [INFO] consul: New leader elected: stratonode0
2016/01/25 03:05:29 [INFO] raft: pipelining replication to peer 1.77.232.232:8300
2016/01/25 03:05:29 [INFO] raft: pipelining replication to peer 1.77.233.135:8300
2016/01/25 03:05:29 [INFO] raft: Removed peer 1.77.233.135:8300, stopping replication (Index: 1)
2016/01/25 03:05:29 [INFO] raft: Removed peer 1.77.232.232:8300, stopping replication (Index: 1)
2016/01/25 03:05:29 [INFO] raft: aborting pipeline replication to peer 1.77.232.232:8300
2016/01/25 03:05:29 [INFO] raft: aborting pipeline replication to peer 1.77.233.135:8300
2016/01/25 03:05:29 [INFO] raft: Added peer 1.77.233.135:8300, starting replication
2016/01/25 03:05:29 [INFO] raft: Added peer 1.77.232.232:8300, starting replication
2016/01/25 03:05:29 [INFO] raft: pipelining replication to peer 1.77.233.135:8300
2016/01/25 03:05:29 [INFO] raft: pipelining replication to peer 1.77.232.232:8300
2016/01/25 03:05:31 [WARN] serf: Failed to re-join any previously known node
2016/01/25 03:05:33 [WARN] agent: Check 'service:rabbitmq-server' is now critical
2016/01/25 03:05:33 [INFO] agent: (LAN) joining: [1.77.232.232 1.77.233.135]
2016/01/25 03:05:33 [INFO] agent: (LAN) joined: 2 Err: <nil>
2016/01/25 03:05:33 [INFO] agent: Join completed. Synced with 2 initial agents
2016/01/25 03:07:15 [INFO] serf: EventMemberJoin: stratonode6577 1.77.217.177
2016/01/25 03:22:57 [INFO] agent.rpc: Accepted client: 127.0.0.1:65150

ARGS: ['consul', 'agent', '-bind=1.77.217.177', '-data-dir=/var/run/consul', '-client=0.0.0.0', '-config-file=/etc/consul_config.json', '-rejoin', '-config-dir=/etc/consul.d', '-retry-interval=5s'] ==> Starting Consul agent... ==> Starting Consul agent RPC... ==> Consul agent running! Node name: 'stratonode6577' Datacenter: 'stratocluster-9626390526591630310__1_2_da49521d' Server: false (bootstrap: false) Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 53, RPC: 8400) Cluster Addr: 1.77.217.177 (LAN: 8301, WAN: 8302) Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false Atlas:

==> Log data will now stream in as it occurs:

2016/01/25 03:07:14 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:07:14 [INFO] serf: EventMemberJoin: stratonode6577 1.77.217.177
2016/01/25 03:07:14 [INFO] serf: Attempting re-join to previously known node: stratonode10631: 1.77.233.135:8301
2016/01/25 03:07:14 [ERR] agent: failed to sync remote state: No known Consul servers
2016/01/25 03:07:14 [INFO] serf: EventMemberJoin: stratonode10472 1.77.232.232
2016/01/25 03:07:14 [INFO] serf: EventMemberJoin: stratonode0 1.77.238.94
2016/01/25 03:07:14 [INFO] consul: adding server stratonode10472 (Addr: 1.77.232.232:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)    2016/01/25 03:07:14 [INFO] serf: EventMemberJoin: stratonode10631 1.77.233.135

2016/01/25 03:07:14 [INFO] consul: adding server stratonode0 (Addr: 1.77.238.94:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:07:14 [INFO] consul: adding server stratonode10631 (Addr: 1.77.233.135:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:07:14 [INFO] serf: Re-joined to previously known node: stratonode10631: 1.77.233.135:8301
2016/01/25 03:07:14 [INFO] consul: New leader elected: stratonode0
2016/01/25 03:23:03 [INFO] agent.rpc: Accepted client: 127.0.0.1:58726

ARGS: ['consul', 'agent', '-bind=1.77.232.232', '-data-dir=/var/run/consul', '-client=0.0.0.0', '-config-file=/etc/consul_config.json', '-rejoin', '-config-dir=/etc/consul.d', '-server', u'-retry-join=1.77.238.94', u'-retry-join=1.77.233.135', '-retry-interval=5s'] ==> Starting raft data migration... ==> Starting Consul agent... ==> Starting Consul agent RPC... ==> Consul agent running! Node name: 'stratonode10472' Datacenter: 'stratocluster-9626390526591630310__1_2_da49521d' Server: true (bootstrap: false) Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 53, RPC: 8400) Cluster Addr: 1.77.232.232 (LAN: 8301, WAN: 8302) Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false Atlas:

==> Log data will now stream in as it occurs:

2016/01/25 03:05:28 [INFO] raft: Node at 1.77.232.232:8300 [Follower] entering Follower state
2016/01/25 03:05:28 [INFO] serf: Ignoring previous leave in snapshot
2016/01/25 03:05:28 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode10472 1.77.232.232
2016/01/25 03:05:28 [INFO] serf: Attempting re-join to previously known node: stratonode6577: 1.77.217.177:8301
2016/01/25 03:05:28 [INFO] consul: adding server stratonode10472 (Addr: 1.77.232.232:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode10472.stratocluster-9626390526591630310__1_2_da49521d 1.77.232.232
2016/01/25 03:05:28 [WARN] serf: Failed to re-join any previously known node
2016/01/25 03:05:28 [INFO] consul: adding server stratonode10472.stratocluster-9626390526591630310__1_2_da49521d (Addr: 1.77.232.232:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [ERR] agent: failed to sync remote state: No cluster leader
2016/01/25 03:05:28 [INFO] agent: Joining cluster...
2016/01/25 03:05:28 [INFO] agent: (LAN) joining: [1.77.238.94 1.77.233.135]
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode10631 1.77.233.135
2016/01/25 03:05:28 [INFO] serf: EventMemberJoin: stratonode0 1.77.238.94
2016/01/25 03:05:28 [INFO] consul: adding server stratonode10631 (Addr: 1.77.233.135:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [INFO] consul: adding server stratonode0 (Addr: 1.77.238.94:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [INFO] agent: (LAN) joined: 2 Err: <nil>
2016/01/25 03:05:28 [INFO] agent: Join completed. Synced with 2 initial agents
2016/01/25 03:05:28 [ERR] http: Request /v1/kv/dirtyprivates/dontuseneverever, error: No cluster leader
2016/01/25 03:05:29 [INFO] consul: New leader elected: stratonode0
2016/01/25 03:05:31 [INFO] serf: Attempting re-join to previously known node: stratonode0: 1.77.238.94:8301
2016/01/25 03:05:31 [INFO] serf: Re-joined to previously known node: stratonode0: 1.77.238.94:8301
2016/01/25 03:06:02 [WARN] agent: Check 'service:nrad' is now critical
2016/01/25 03:07:15 [INFO] serf: EventMemberJoin: stratonode6577 1.77.217.177
2016/01/25 03:22:57 [INFO] agent.rpc: Accepted client: 127.0.0.1:51723

ARGS: ['consul', 'agent', '-bind=1.77.233.135', '-data-dir=/var/run/consul', '-client=0.0.0.0', '-config-file=/etc/consul_config.json', '-rejoin', '-config-dir=/etc/consul.d', '-server', u'-retry-join=1.77.232.232', u'-retry-join=1.77.238.94', '-retry-interval=5s'] ==> Starting raft data migration... ==> Starting Consul agent... ==> Starting Consul agent RPC... ==> Consul agent running! Node name: 'stratonode10631' Datacenter: 'stratocluster-9626390526591630310__1_2_da49521d' Server: true (bootstrap: false) Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: -1, DNS: 53, RPC: 8400) Cluster Addr: 1.77.233.135 (LAN: 8301, WAN: 8302) Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false Atlas:

==> Log data will now stream in as it occurs:

2016/01/25 03:05:27 [INFO] raft: Node at 1.77.233.135:8300 [Follower] entering Follower state
2016/01/25 03:05:27 [INFO] serf: Ignoring previous leave in snapshot
2016/01/25 03:05:27 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:05:27 [INFO] serf: EventMemberJoin: stratonode10631 1.77.233.135
2016/01/25 03:05:27 [INFO] serf: Attempting re-join to previously known node: stratonode10472: 1.77.232.232:8301
2016/01/25 03:05:27 [INFO] consul: adding server stratonode10631 (Addr: 1.77.233.135:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:27 [INFO] serf: Attempting re-join to previously known node: stratonode0: 1.77.238.94:8301
2016/01/25 03:05:27 [WARN] memberlist: Binding to public address without encryption!
2016/01/25 03:05:27 [INFO] serf: EventMemberJoin: stratonode10631.stratocluster-9626390526591630310__1_2_da49521d 1.77.233.135
2016/01/25 03:05:27 [WARN] serf: Failed to re-join any previously known node
2016/01/25 03:05:27 [INFO] consul: adding server stratonode10631.stratocluster-9626390526591630310__1_2_da49521d (Addr: 1.77.233.135:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:27 [INFO] agent: Joining cluster...
2016/01/25 03:05:27 [INFO] agent: (LAN) joining: [1.77.232.232 1.77.238.94]
2016/01/25 03:05:27 [ERR] agent: failed to sync remote state: No cluster leader
2016/01/25 03:05:27 [INFO] serf: EventMemberJoin: stratonode0 1.77.238.94
2016/01/25 03:05:27 [INFO] consul: adding server stratonode0 (Addr: 1.77.238.94:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:27 [INFO] serf: Re-joined to previously known node: stratonode0: 1.77.238.94:8301
2016/01/25 03:05:27 [INFO] agent: (LAN) joined: 1 Err: <nil>
2016/01/25 03:05:27 [INFO] agent: Join completed. Synced with 1 initial agents
2016/01/25 03:05:27 [INFO] serf: EventMemberJoin: stratonode10472 1.77.232.232
2016/01/25 03:05:27 [INFO] consul: adding server stratonode10472 (Addr: 1.77.232.232:8300) (DC: stratocluster-9626390526591630310__1_2_da49521d)
2016/01/25 03:05:28 [ERR] http: Request /v1/kv/dirtyprivates/dontuseneverever, error: No cluster leader
2016/01/25 03:05:28 [ERR] http: Request /v1/kv/dirtyprivates/dontuseneverever, error: No cluster leader
2016/01/25 03:05:28 [INFO] consul: New leader elected: stratonode0
2016/01/25 03:07:14 [INFO] serf: EventMemberJoin: stratonode6577 1.77.217.177
2016/01/25 03:23:09 [INFO] agent.rpc: Accepted client: 127.0.0.1:52391
slackpad commented 8 years ago

@shimshon-stratoscale interesting these lines are strange and look like they might be causing the cluster to get walked down into an outage state:

2016/01/25 03:05:29 [INFO] raft: Removed peer 1.77.233.135:8300, stopping replication (Index: 1)
2016/01/25 03:05:29 [INFO] raft: Removed peer 1.77.232.232:8300, stopping replication (Index: 1)

It adds them back right away, but it looks like it thinks they have done a leave. Is it possible during power down that your servers get a few seconds to terminate processes? I'm wondering if setting https://www.consul.io/docs/agent/options.html#skip_leave_on_interrupt to true would help here?

shimzim commented 8 years ago

@slackpad - we caused the crash in our test using an "ipmi power off" command, which as far as I understand is not graceful. In addition, our system is configured such that even in a graceful shutdown, we stop consul with SIGKILL, so that we never really leave the cluster (unless we specifically want to have that node leave the culster), so I don't think that's the case here.

slackpad commented 7 years ago

Possibly related - https://github.com/hashicorp/consul/issues/2107.

hanshasselberg commented 4 years ago

Thank you for reporting! This issue is very old which is why I am closing it. If it is still relevant or if there are updates, feel free to open a new issue.