hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.31k stars 4.42k forks source link

Consul write api hangs for half hour before return rpc error #3738

Open hehailong5 opened 6 years ago

hehailong5 commented 6 years ago

version: 0.8.4

when this happens, the leader repetively print:

2017/12/05 11:10:21 [WARN] raft: AppendEntries to {Voter 10.0.0.9:8300 10.0.0.9:8300} rejected, sending older logs (next: 14336) 2017/12/05 11:10:21 [ERR] raft: Failed to get log at index 14335: log not found

and the follower prints:

2017/12/05 11:10:11 [ERR] consul: failed to reconcile member: {server-1 10.0.0.10 8301 map[build:0.8.4:f436077 vsn_max:3 raft_vsn:2 wan_join_port:8302 dc:dc1 id:c29aeab8-3731-17d0-8796-b7af5d3953ea port:8300 role:consul vsn:2 vsn_min:2] alive 1 5 2 2 5 4}: leadership lost while committing log 2017/12/05 11:10:11 [ERR] consul: failed to reconcile: leadership lost while committing log 2017/12/05 11:10:11 [INFO] consul: cluster leadership lost 2017/12/05 11:10:18 [WARN] raft: Previous log term mis-match: ours: 4 remote: 5 2017/12/05 11:10:18 [INFO] consul: New leader elected: server-1 2017/12/05 11:10:18 [INFO] snapshot: Creating new snapshot at /opt/application/consul-works/data-dir/raft/snapshots/5-14336-1512472218803.tmp

2017/12/05 11:10:18 [INFO] snapshot: reaping snapshot /opt/application/consul-works/data-dir/raft/snapshots/3-12783-1512459076524
2017/12/05 11:10:18 [INFO] raft: Copied 625724 bytes to local snapshot
2017/12/05 11:10:18 [INFO] raft: Compacting logs from 14333 to 4113
2017/12/05 11:10:18 [INFO] raft: Installed remote snapshot
2017/12/05 11:10:18 [WARN] raft: Previous log term mis-match: ours: 4 remote: 5
2017/12/05 11:10:18 [INFO] snapshot: Creating new snapshot at /opt/application/consul-works/data-dir/raft/snapshots/5-14336-1512472218916.tmp
slackpad commented 6 years ago

Hi @hehailong5 can you reproduce this or was this a one-time event? Also, there have been several Raft-related fixes since 0.8.4 so I'd definitely recommend running a newer version of Consul.

hehailong5 commented 6 years ago

Only see this once since we upgraded to 0.8.4. I have two questions though: 1) Is there any timeout option for consul apis? 2) We detected this after half hour since we monitor the health of consul cluster via /v1/status/leader while in this case this url works as expected. how do we measure the health for sure then?

slackpad commented 6 years ago

Hi @hehailong5 we've fixed two issues in 1.0.0 and later that are probably related to this - #3545 and #3700. There's normally a timeout related to Raft itself where if a leader loses contact with its followers it will step down. With those issues Raft itself was working ok but there was a problem on the leader preventing it from taking writes, so things could get stuck.

What's weird though is that you are seeing "log not found" errors (similar to https://github.com/hashicorp/consul/issues/2837) that aren't consistent with that, so I think this needs a deeper look.

chymy commented 6 years ago

Hi @slackpad I also encountered the problem #3852.When will it be solved?

alitvak69 commented 6 years ago

Sorry to piggyback on the issue but as recent as 10/01 we have experienced a system-wide outage because of the similar lockups described also in #3852. With TTL set u to 3 seconds for a number of services we had a wave of times out where TTL was exceeding 3 - 5 seconds. Since this is a fairly small cluster updates normally take mS or even uS for us. Can you please review the issue and let us know what may be done to prevent it from happening again.

Any help is greatly appreciated,