Closed BRMatt closed 7 years ago
This is interesting, in the logs I noticed that the event buffer is never being drained and just keeps accumulating. That code is all the way down in memberlist, so something strange is going on here. Thanks for the detailed report, we'll dive into this!
Thanks! If you want I can provide the full consul log file? It goes back a few months and would have the lines leading up to this event (disclaimer: it's 1.24GB).
Can I vote for this bug too? We have a usecase where we are using consul in a single node too and are running into this. We are having to introduce retries for all client calls to mitigate.
Yes. Please fix it. :pray: This is a bone-breaking issue for us. :scream:
Root cause and some discussion here - https://groups.google.com/forum/#!msg/consul-tool/WsDZ1vFEuu0/Rt5qV-t4MgAJ.
@slackpad I'm not so sure they're related, the issue I had was related to the node acquiring leadership, then losing it straight away (within a second, when running with quorum of 1). The buffer thing was separate to that.
Either way, I haven't seen this recently, so could be worth closing, unless others are also experiencing it?
Closing this out. The event concern is addressed via https://github.com/hashicorp/consul/issues/1387#issuecomment-238988733 and we've added a -dev
mode since 0.5.2 which makes it easy to run a single server cluster for development. You will need to have any applications retry while leadership is being established, but things should be stable after that.
Also note that -dev
runs totally without any data directory so it's much easier to not have old state affect future clusters (looks like the original poster might have had a stale Raft peer).
I have consul 0.5.2 running in a single long running, local, ubuntu 12.04 VM. Occasionally consul enters a "flappy" state where it seems to be in both a leader and a follower...?
Here's a sample of the logs. This section was repeating itself over and over every few seconds.
Here's a copy of the config:
And nearly all of the services consul monitors are configured like so:
I noticed that there are two ips referenced in the logs, but there's only one agent running on the machine. Here's a copy of
ifconfig
in this VM (eth0 is the NAT, eth1 is the interface for the private VBox network):Restarting the service didn't fix the issue, I had to
sudo rm -rf /var/consul
then restart the service for it to calm down.It may be worth noting that consul was consuming around 1GB of RES during this, and after the restart/nuking the raft dir it was using just 13MB.
Here is what the log file contained after nuking and restarting:
It's worth noting that we've never experienced this problem in production, only on local VMs. This may be related to #721, but I created a separate issue as the log messages seem to differ.