Open hehailong5 opened 6 years ago
Hi @hehailong5 can you reproduce this or was this a one-time event? Also, there have been several Raft-related fixes since 0.8.4 so I'd definitely recommend running a newer version of Consul.
Only see this once since we upgraded to 0.8.4. I have two questions though: 1) Is there any timeout option for consul apis? 2) We detected this after half hour since we monitor the health of consul cluster via /v1/status/leader while in this case this url works as expected. how do we measure the health for sure then?
Hi @hehailong5 we've fixed two issues in 1.0.0 and later that are probably related to this - #3545 and #3700. There's normally a timeout related to Raft itself where if a leader loses contact with its followers it will step down. With those issues Raft itself was working ok but there was a problem on the leader preventing it from taking writes, so things could get stuck.
What's weird though is that you are seeing "log not found" errors (similar to https://github.com/hashicorp/consul/issues/2837) that aren't consistent with that, so I think this needs a deeper look.
Hi @slackpad I also encountered the problem #3852.When will it be solved?
Sorry to piggyback on the issue but as recent as 10/01 we have experienced a system-wide outage because of the similar lockups described also in #3852. With TTL set u to 3 seconds for a number of services we had a wave of times out where TTL was exceeding 3 - 5 seconds. Since this is a fairly small cluster updates normally take mS or even uS for us. Can you please review the issue and let us know what may be done to prevent it from happening again.
Any help is greatly appreciated,
version: 0.8.4
when this happens, the leader repetively print:
and the follower prints: