hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.22k stars 4.41k forks source link

Consul ocassionally enters a flappy state when running in a single server cluster #1387

Closed BRMatt closed 7 years ago

BRMatt commented 8 years ago

I have consul 0.5.2 running in a single long running, local, ubuntu 12.04 VM. Occasionally consul enters a "flappy" state where it seems to be in both a leader and a follower...?

Here's a sample of the logs. This section was repeating itself over and over every few seconds.

Here's a copy of the config:

{
  "server": true,
  "datacenter": "us-east-1",
  "data_dir": "/var/consul",
  "bootstrap_expect": 1,
  "rejoin_after_leave": true,
  "retry_join": [

  ]
}

And nearly all of the services consul monitors are configured like so:

{
  "service": {
    "id": "omnibus",
    "name": "omnibus",
    "port": 8081,
    "tags": [
      "server"
    ]
  }
}

I noticed that there are two ips referenced in the logs, but there's only one agent running on the machine. Here's a copy of ifconfig in this VM (eth0 is the NAT, eth1 is the interface for the private VBox network):

eth0      Link encap:Ethernet  HWaddr 08:00:27:88:0c:a6
          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe88:ca6/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:239491 errors:0 dropped:0 overruns:0 frame:0
          TX packets:169008 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:84093600 (84.0 MB)  TX bytes:21916505 (21.9 MB)

eth1      Link encap:Ethernet  HWaddr 08:00:27:96:24:be
          inet addr:192.168.33.10  Bcast:192.168.33.255  Mask:255.255.255.0
          inet6 addr: fe80::a00:27ff:fe96:24be/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:35370 errors:0 dropped:0 overruns:0 frame:0
          TX packets:59450 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:8122628 (8.1 MB)  TX bytes:79961335 (79.9 MB)

lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:1644618233 errors:0 dropped:0 overruns:0 frame:0
          TX packets:1644618233 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:1629255731101 (1.6 TB)  TX bytes:1629255731101 (1.6 TB)

Restarting the service didn't fix the issue, I had to sudo rm -rf /var/consul then restart the service for it to calm down.

It may be worth noting that consul was consuming around 1GB of RES during this, and after the restart/nuking the raft dir it was using just 13MB.

Here is what the log file contained after nuking and restarting:

==> WARNING: BootstrapExpect Mode is specified as 1; this is the same as Bootstrap mode.
==> WARNING: Bootstrap mode enabled! Do not enable unless necessary
==> Starting raft data migration...
==> Starting Consul agent...
==> Starting Consul agent RPC...
==> Consul agent running!
         Node name: 'precise64'
        Datacenter: 'us-east-1'
            Server: true (bootstrap: true)
       Client Addr: 127.0.0.1 (HTTP: 8500, HTTPS: -1, DNS: 8600, RPC: 8400)
      Cluster Addr: 10.0.2.15 (LAN: 8301, WAN: 8302)
    Gossip encrypt: false, RPC-TLS: false, TLS-Incoming: false
             Atlas: <disabled>

==> Log data will now stream in as it occurs:

    2015/11/06 11:45:54 [INFO] raft: Restored from snapshot 0-0-1446810320315
    2015/11/06 11:45:54 [INFO] serf: Ignoring previous leave in snapshot
    2015/11/06 11:45:54 [INFO] serf: EventMemberJoin: precise64 10.0.2.15
    2015/11/06 11:45:54 [INFO] serf: Ignoring previous leave in snapshot
    2015/11/06 11:45:54 [INFO] serf: EventMemberJoin: precise64.us-east-1 10.0.2.15
    2015/11/06 11:45:54 [INFO] raft: Node at 10.0.2.15:8300 [Follower] entering Follower state
    2015/11/06 11:45:54 [WARN] serf: Failed to re-join any previously known node
    2015/11/06 11:45:54 [WARN] serf: Failed to re-join any previously known node
    2015/11/06 11:45:54 [INFO] consul: adding server precise64.us-east-1 (Addr: 10.0.2.15:8300) (DC: us-east-1)
    2015/11/06 11:45:54 [INFO] consul: adding server precise64 (Addr: 10.0.2.15:8300) (DC: us-east-1)
    2015/11/06 11:45:54 [ERR] agent: failed to sync remote state: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/omnibus?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/gecko-data?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/management-api?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/atlas?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/gecko-auth?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/app?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/bliss?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/billing?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/expo?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/spreadsheets?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/hippo?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/tabloid?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/track?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/pegasus?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [ERR] http: Request /v1/health/service/fake-auth?dc=us-east-1&index=104466&passing=1&tag=server&wait=60000ms, error: No cluster leader
    2015/11/06 11:45:55 [WARN] raft: Heartbeat timeout reached, starting election
    2015/11/06 11:45:55 [INFO] raft: Node at 10.0.2.15:8300 [Candidate] entering Candidate state
    2015/11/06 11:45:55 [INFO] raft: Election won. Tally: 1
    2015/11/06 11:45:55 [INFO] raft: Node at 10.0.2.15:8300 [Leader] entering Leader state
    2015/11/06 11:45:55 [INFO] consul: cluster leadership acquired
    2015/11/06 11:45:55 [INFO] consul: New leader elected: precise64
    2015/11/06 11:45:55 [INFO] raft: Disabling EnableSingleNode (bootstrap)
    2015/11/06 11:45:57 [INFO] agent: Synced check 'service:haproxy-clerk'
    2015/11/06 11:45:59 [INFO] agent: Synced check 'service:haproxy-clerk'

It's worth noting that we've never experienced this problem in production, only on local VMs. This may be related to #721, but I created a separate issue as the log messages seem to differ.

ryanuber commented 8 years ago

This is interesting, in the logs I noticed that the event buffer is never being drained and just keeps accumulating. That code is all the way down in memberlist, so something strange is going on here. Thanks for the detailed report, we'll dive into this!

BRMatt commented 8 years ago

Thanks! If you want I can provide the full consul log file? It goes back a few months and would have the lines leading up to this event (disclaimer: it's 1.24GB).

saimonmoore commented 8 years ago

Can I vote for this bug too? We have a usecase where we are using consul in a single node too and are running into this. We are having to introduce retries for all client calls to mitigate.

ivanilves commented 8 years ago

Yes. Please fix it. :pray: This is a bone-breaking issue for us. :scream:

slackpad commented 8 years ago

Root cause and some discussion here - https://groups.google.com/forum/#!msg/consul-tool/WsDZ1vFEuu0/Rt5qV-t4MgAJ.

BRMatt commented 8 years ago

@slackpad I'm not so sure they're related, the issue I had was related to the node acquiring leadership, then losing it straight away (within a second, when running with quorum of 1). The buffer thing was separate to that.

Either way, I haven't seen this recently, so could be worth closing, unless others are also experiencing it?

slackpad commented 7 years ago

Closing this out. The event concern is addressed via https://github.com/hashicorp/consul/issues/1387#issuecomment-238988733 and we've added a -dev mode since 0.5.2 which makes it easy to run a single server cluster for development. You will need to have any applications retry while leadership is being established, but things should be stable after that.

slackpad commented 7 years ago

Also note that -dev runs totally without any data directory so it's much easier to not have old state affect future clusters (looks like the original poster might have had a stale Raft peer).