hashicorp / docker-consul

Official Docker images for Consul.
Mozilla Public License 2.0
398 stars 238 forks source link

Consul Swarm mode network configuration #66

Open seafoodbuffet opened 7 years ago

seafoodbuffet commented 7 years ago

I'm using the official consul docker image version 0.9.3 with the following compose file:

version: '3'

services:
  consul:
    image: consul:0.9.3
    environment:
      CONSUL_LOCAL_CONFIG: '{"skip_leave_on_interrupt": true}'
      CONSUL_BIND_INTERFACE: eth0
    command: agent -ui -data-dir /consul/data -server -bootstrap-expect 3 -client 0.0.0.0 -log-level debug -retry-join 10.10.0.3 -retry-join 10.10.0.4 -retry-join 10.10.0.5
    deploy:
      mode: replicated
      replicas: 3
    networks:
      - consul
    ports:
      - "8500:8500"
    volumes:
      - consul_data:/consul/data

networks:
  consul:
    driver: overlay
    ipam:
      driver: default
      config:
        - subnet: 10.10.0.0/24

volumes:
  consul_data:

This appears to work okay for me in a Docker Swarm with 3 nodes when deploying this as a stack using docker stack deploy

My question is this: without the -retry-joins the cluster can't bootstrap. Per the bootstrap documentation I believe this is expected to prevent split-brain, etc. So what's the best way to bootstrap a consul cluster running as a Docker Swarm Service?

I am only able to make the cluster bootstrap after having added the -retry-join statements to the container command. It seems non-ideal to have to specify these hard-coded IPs. For example, what if another container started up in the consul network first? Presumably it would interfere with the resulting IPs of the consul server containers.

Is there a recommendation on how to deal with this? The only other thing I can think of would be this:

  1. Deploy the consul servers via docker stack deploy
  2. At this point, all the nodes of the service are running but the cluster isn't bootstrapped
  3. Now figure out the IP's of each of the container nodes running consul server
  4. Manually issue consul join commands to one of the nodes, supplying the IP's of the other containers
isuftin commented 7 years ago

@seafoodbuffet Here's what I am doing..

Server config:

{
  "advertise_addr" : "{{ GetInterfaceIP \"eth2\" }}",
  "addresses" : {
    "https" : "0.0.0.0"
  },
  "bind_addr": "{{ GetInterfaceIP \"eth2\" }}",
  "check_update_interval": "1m",
  "client_addr": "0.0.0.0",
  "data_dir": "/tmp/consul",
  "datacenter": "docker_dc",
  "disable_host_node_id" : true,
  "disable_remote_exec": true,
  "disable_update_check": true,
  "ca_file": "/run/secrets/consul_ca_file.cer",
  "cert_file": "/run/secrets/consul_cert_file.cer",
  "key_file": "/run/secrets/consul_key_file.key",
  "verify_outgoing" : true,
  "verify_incoming_https" : false,
  "verify_incoming_rpc" : true,
  "verify_server_hostname" : true,
  "encrypt_verify_incoming" : true,
  "encrypt_verify_outgoing" : true,
  "http_config": {
    "response_headers": {
      "Access-Control-Allow-Origin": "*"
    }
  },
  "leave_on_terminate" : true,
  "retry_interval" : "10s",
  "retry_join" : [
    "server.consul.swarm.container:8301",
    "server.consul.swarm.container:8301",
    "server.consul.swarm.container:8301"
  ],
  "server_name" : "server.docker_dc.consul",
  "skip_leave_on_interrupt" : true,
  "bootstrap_expect": 3,
  "node_meta": {
      "instance_type": "Docker container"
  },
  "ports" : {
    "https" : 8700
  },
  "server" : true,
  "ui" : true
}

Compose config (This is part of a Hashicorp Vault config so you'll notice some of that here as well as some verbiage around a consul agent which I didn't include in the config here):

---
version: '3.3'

configs:
  consul_server_config:
    file: ./consul/data/server_config.json
  consul_agent_config:
    file: ./consul/data/agent_config.json
  common_config:
    file: ./consul/data/common.json

secrets:
  consul_ca_file.cer:
    file: ./consul/data/certificates/consul-root.cer
  consul_cert_file.cer:
    file: ./consul/data/certificates/consul-server.cer
  consul_key_file.key:
    file: ./consul/data/certificates/consul-server.key
  consul_common_secrets_config.json:
    file: ./consul/data/common_secrets_config.json
  consul_server_secrets_config.json:
    file: ./consul/data/server_secrets_config.json
  consul_agent_secrets_config.json:
    file: ./consul/data/agent_secrets_config.json

networks:
  vault-network:

services:
  consul_server:
    image: consul:0.9.3
    networks:
      vault-network:
        aliases:
          - server.consul.swarm.container
    command: "consul agent -config-dir=/data/config -config-file=/run/secrets/consul_server_secrets_config.json -config-file=/run/secrets/consul_common_secrets_config.json"
    ports:
      - "8700:8700"
    deploy:
      mode: replicated
      replicas: 3
      update_config:
        parallelism: 1
        failure_action: pause
        delay: 10s
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      placement:
        constraints:
          - node.role == worker
    configs:
      - source: common_config
        target: /data/config/node_swarm_config.json
      - source: consul_server_config
        target: /data/config/config.json
    secrets:
      - consul_ca_file.cer
      - consul_cert_file.cer
      - consul_key_file.key
      - consul_common_secrets_config.json
      - consul_server_secrets_config.json

Then I can simply docker stack deploy -c docker-compose.yml consul

I think the trick here is to have the same well known address in retry_join. It will launch 3 nodes and join them up as expected. Afterwards, I can also scale via docker service scale consul_consul_server=x x being however many I want, but I usually increment by 1 or 2 so that new nodes don't try to access each other and join their own swarm.

Does this make sense to you? I was also running into the same split-brain issue and I didn't want to spend the energy creating a seeding swarm service as that also requires maintenance in terms of bringing it up, bringing it down when a quorum is reached and then re-joining it to the swarm.

Output:

vault_consul_server.3.vuuuvv2m330f@worker1    | ==> WARNING: Expect Mode enabled, expecting 3 servers
vault_consul_server.3.vuuuvv2m330f@worker1    | ==> Starting Consul agent...
vault_consul_server.3.vuuuvv2m330f@worker1    | ==> Consul agent running!
vault_consul_server.1.83wl0xdzi4jz@worker3    | ==> WARNING: Expect Mode enabled, expecting 3 servers
vault_consul_server.1.83wl0xdzi4jz@worker3    | ==> Starting Consul agent...
vault_consul_server.1.83wl0xdzi4jz@worker3    | ==> Consul agent running!
vault_consul_server.1.83wl0xdzi4jz@worker3    |            Version: 'v0.9.3'
vault_consul_server.1.83wl0xdzi4jz@worker3    |            Node ID: '631e735d-2053-1b7d-f7ce-09da8e24ff84'
vault_consul_server.1.83wl0xdzi4jz@worker3    |          Node name: '103b1c84548b'
vault_consul_server.1.83wl0xdzi4jz@worker3    |         Datacenter: 'docker_dc' (Segment: '<all>')
vault_consul_server.1.83wl0xdzi4jz@worker3    |             Server: true (Bootstrap: false)
vault_consul_server.1.83wl0xdzi4jz@worker3    |        Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8700, DNS: 8600)
vault_consul_server.1.83wl0xdzi4jz@worker3    |       Cluster Addr: 10.0.0.7 (LAN: 8301, WAN: 8302)
vault_consul_server.3.vuuuvv2m330f@worker1    |            Version: 'v0.9.3'
vault_consul_server.3.vuuuvv2m330f@worker1    |            Node ID: '20428004-715f-e93d-8dc2-3a9437d521c4'
vault_consul_server.3.vuuuvv2m330f@worker1    |          Node name: 'da1511907468'
vault_consul_server.3.vuuuvv2m330f@worker1    |         Datacenter: 'docker_dc' (Segment: '<all>')
vault_consul_server.3.vuuuvv2m330f@worker1    |             Server: true (Bootstrap: false)
vault_consul_server.3.vuuuvv2m330f@worker1    |        Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8700, DNS: 8600)
vault_consul_server.3.vuuuvv2m330f@worker1    |       Cluster Addr: 10.0.0.9 (LAN: 8301, WAN: 8302)
vault_consul_server.3.vuuuvv2m330f@worker1    |            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false
vault_consul_server.3.vuuuvv2m330f@worker1    |
vault_consul_server.3.vuuuvv2m330f@worker1    | ==> Log data will now stream in as it occurs:
vault_consul_server.3.vuuuvv2m330f@worker1    |
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] raft: Initial configuration (index=0): []
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: da1511907468.docker_dc 10.0.0.9
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: da1511907468 10.0.0.9
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] raft: Node at 10.0.0.9:8300 [Follower] entering Follower state (Leader: "")
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server da1511907468 (Addr: tcp/10.0.0.9:8300) (DC: docker_dc)
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "da1511907468.docker_dc" in area "wan"
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: Started HTTP server on [::]:8500
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: Started HTTPS server on [::]:8700
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: Joining LAN cluster...
vault_consul_server.1.83wl0xdzi4jz@worker3    |            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false
vault_consul_server.1.83wl0xdzi4jz@worker3    |
vault_consul_server.1.83wl0xdzi4jz@worker3    | ==> Log data will now stream in as it occurs:
vault_consul_server.1.83wl0xdzi4jz@worker3    |
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] raft: Initial configuration (index=0): []
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: 103b1c84548b.docker_dc 10.0.0.7
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: 103b1c84548b 10.0.0.7
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] raft: Node at 10.0.0.7:8300 [Follower] entering Follower state (Leader: "")
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server 103b1c84548b (Addr: tcp/10.0.0.7:8300) (DC: docker_dc)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "103b1c84548b.docker_dc" in area "wan"
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: Started HTTP server on [::]:8500
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: Started HTTPS server on [::]:8700
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: Joining LAN cluster...
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] agent: (LAN) joining: [server.consul.swarm.container:8301 server.consul.swarm.container:8301 server.consul.swarm.container:8301]
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: da1511907468 10.0.0.9
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server da1511907468 (Addr: tcp/10.0.0.9:8300) (DC: docker_dc)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: da1511907468.docker_dc 10.0.0.9
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "da1511907468.docker_dc" in area "wan"
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: df206dff9a80 10.0.0.8
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server df206dff9a80 (Addr: tcp/10.0.0.8:8300) (DC: docker_dc)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Found expected number of peers, attempting bootstrap: 10.0.0.7:8300,10.0.0.9:8300,10.0.0.8:8300
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: df206dff9a80.docker_dc 10.0.0.8
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "df206dff9a80.docker_dc" in area "wan"
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:04 [INFO] agent: (LAN) joined: 3 Err: <nil>
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:04 [INFO] agent: Join LAN completed. Synced with 3 initial agents
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] agent: (LAN) joining: [server.consul.swarm.container:8301 server.consul.swarm.container:8301 server.consul.swarm.container:8301]
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: 103b1c84548b 10.0.0.7
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server 103b1c84548b (Addr: tcp/10.0.0.7:8300) (DC: docker_dc)
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: 103b1c84548b.docker_dc 10.0.0.7
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "103b1c84548b.docker_dc" in area "wan"
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: df206dff9a80 10.0.0.8
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server df206dff9a80 (Addr: tcp/10.0.0.8:8300) (DC: docker_dc)
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Existing Raft peers reported by 103b1c84548b, disabling bootstrap mode
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: df206dff9a80.docker_dc 10.0.0.8
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "df206dff9a80.docker_dc" in area "wan"
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:04 [INFO] agent: (LAN) joined: 3 Err: <nil>
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:04 [INFO] agent: Join LAN completed. Synced with 3 initial agents
vault_consul_server.2.e2xxgzxjyyt6@worker2    | ==> WARNING: Expect Mode enabled, expecting 3 servers
vault_consul_server.2.e2xxgzxjyyt6@worker2    | ==> Starting Consul agent...
vault_consul_server.2.e2xxgzxjyyt6@worker2    | ==> Consul agent running!
vault_consul_server.2.e2xxgzxjyyt6@worker2    |            Version: 'v0.9.3'
vault_consul_server.2.e2xxgzxjyyt6@worker2    |            Node ID: '35b4a24d-8c94-55b5-df64-a3e38aca4d4b'
vault_consul_server.2.e2xxgzxjyyt6@worker2    |          Node name: 'df206dff9a80'
vault_consul_server.2.e2xxgzxjyyt6@worker2    |         Datacenter: 'docker_dc' (Segment: '<all>')
vault_consul_server.2.e2xxgzxjyyt6@worker2    |             Server: true (Bootstrap: false)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |        Client Addr: 0.0.0.0 (HTTP: 8500, HTTPS: 8700, DNS: 8600)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |       Cluster Addr: 10.0.0.8 (LAN: 8301, WAN: 8302)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |            Encrypt: Gossip: true, TLS-Outgoing: true, TLS-Incoming: false
vault_consul_server.2.e2xxgzxjyyt6@worker2    |
vault_consul_server.2.e2xxgzxjyyt6@worker2    | ==> Log data will now stream in as it occurs:
vault_consul_server.2.e2xxgzxjyyt6@worker2    |
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] raft: Initial configuration (index=0): []
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: df206dff9a80.docker_dc 10.0.0.8
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: df206dff9a80 10.0.0.8
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] raft: Node at 10.0.0.8:8300 [Follower] entering Follower state (Leader: "")
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server df206dff9a80 (Addr: tcp/10.0.0.8:8300) (DC: docker_dc)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "df206dff9a80.docker_dc" in area "wan"
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: Started HTTP server on [::]:8500
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: Started HTTPS server on [::]:8700
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: Joining LAN cluster...
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] agent: (LAN) joining: [server.consul.swarm.container:8301 server.consul.swarm.container:8301 server.consul.swarm.container:8301]
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: da1511907468 10.0.0.9
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: 103b1c84548b 10.0.0.7
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server da1511907468 (Addr: tcp/10.0.0.9:8300) (DC: docker_dc)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: 103b1c84548b.docker_dc 10.0.0.7
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] serf: EventMemberJoin: da1511907468.docker_dc 10.0.0.9
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "103b1c84548b.docker_dc" in area "wan"
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Handled member-join event for server "da1511907468.docker_dc" in area "wan"
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Existing Raft peers reported by 103b1c84548b, disabling bootstrap mode
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:03 [INFO] consul: Adding LAN server 103b1c84548b (Addr: tcp/10.0.0.7:8300) (DC: docker_dc)
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:04 [INFO] agent: (LAN) joined: 3 Err: <nil>
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:04 [INFO] agent: Join LAN completed. Synced with 3 initial agents
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:10 [ERR] agent: failed to sync remote state: No cluster leader
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:10 [ERR] agent: failed to sync remote state: No cluster leader
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:10 [ERR] agent: failed to sync remote state: No cluster leader
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:11 [WARN] raft: no known peers, aborting election
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [WARN] raft: Heartbeat timeout from "" reached, starting election
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: Node at 10.0.0.7:8300 [Candidate] entering Candidate state in term 2
vault_consul_server.2.e2xxgzxjyyt6@worker2    | 2017/10/13 16:09:11 [DEBUG] raft-net: 10.0.0.8:8300 accepted connection from: 10.0.0.7:43725
vault_consul_server.3.vuuuvv2m330f@worker1    | 2017/10/13 16:09:11 [DEBUG] raft-net: 10.0.0.9:8300 accepted connection from: 10.0.0.7:45257
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: Election won. Tally: 2
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: Node at 10.0.0.7:8300 [Leader] entering Leader state
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: Added peer 10.0.0.9:8300, starting replication
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: Added peer 10.0.0.8:8300, starting replication
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:11 [WARN] raft: Failed to get previous log: 1 log not found (last: 0)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] consul: cluster leadership acquired
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] consul: New leader elected: 103b1c84548b
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [WARN] raft: AppendEntries to {Voter 10.0.0.8:8300 10.0.0.8:8300} rejected, sending older logs (next: 1)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: pipelining replication to peer {Voter 10.0.0.8:8300 10.0.0.8:8300}
vault_consul_server.3.vuuuvv2m330f@worker1    | 2017/10/13 16:09:11 [DEBUG] raft-net: 10.0.0.9:8300 accepted connection from: 10.0.0.7:57611
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [WARN] raft: AppendEntries to {Voter 10.0.0.9:8300 10.0.0.9:8300} rejected, sending older logs (next: 1)
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:11 [WARN] raft: Failed to get previous log: 1 log not found (last: 0)
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] consul: member '103b1c84548b' joined, marking health alive
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] raft: pipelining replication to peer {Voter 10.0.0.9:8300 10.0.0.9:8300}
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] consul: member 'da1511907468' joined, marking health alive
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:11 [INFO] consul: member 'df206dff9a80' joined, marking health alive
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:11 [INFO] consul: New leader elected: 103b1c84548b
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:11 [INFO] consul: New leader elected: 103b1c84548b
vault_consul_server.2.e2xxgzxjyyt6@worker2    | 2017/10/13 16:09:12 [DEBUG] raft-net: 10.0.0.8:8300 accepted connection from: 10.0.0.7:55077
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:12 [INFO] serf: EventMemberJoin: 281bbc1ecffc 10.0.0.5
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:12 [INFO] serf: EventMemberJoin: 281bbc1ecffc 10.0.0.5
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:12 [INFO] consul: member '281bbc1ecffc' joined, marking health alive
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:12 [INFO] serf: EventMemberJoin: 281bbc1ecffc 10.0.0.5
vault_consul_server.2.e2xxgzxjyyt6@worker2    |     2017/10/13 16:09:12 [INFO] agent: Synced node info
vault_consul_server.3.vuuuvv2m330f@worker1    |     2017/10/13 16:09:12 [INFO] agent: Synced node info
vault_consul_server.1.83wl0xdzi4jz@worker3    |     2017/10/13 16:09:14 [INFO] agent: Synced node info
askulkarni2 commented 6 years ago

I am running into this issues as well. My service configuration is identical to @seafoodbuffet's compose above.

@isuftin what is the reason for specifying the same retry-join value three times?

...
"retry_join" : [
    "server.consul.swarm.container:8301",
    "server.consul.swarm.container:8301",
    "server.consul.swarm.container:8301"
  ],
...
alexrun commented 6 years ago

@seafoodbuffet I tried your solution with the latest version and the consul swarm wasn't able to elect a leader.

==> Found address '10.10.0.5' for interface 'eth0', setting bind option...
bootstrap_expect > 0: expecting 3 servers
==> Starting Consul agent...
==> Consul agent running!
           Version: 'v1.0.0'
           Node ID: 'fee384d6-3359-bef1-f82d-b108807d58bf'
         Node name: '77d163627a13'
        Datacenter: 'dc1' (Segment: '<all>')
            Server: true (Bootstrap: false)
       Client Addr: [0.0.0.0] (HTTP: 8500, HTTPS: -1, DNS: 8600)
      Cluster Addr: 10.10.0.5 (LAN: 8301, WAN: 8302)
           Encrypt: Gossip: false, TLS-Outgoing: false, TLS-Incoming: false

==> Log data will now stream in as it occurs:

    2017/11/16 01:11:51 [INFO] raft: Initial configuration (index=1): [{Suffrage:Voter ID:10.10.0.4:8300 Address:10.10.0.4:8300} {Suffrage:Voter ID:10.10.0.3:8300 Address:10.10.0.3:8300} {Suffrage:Voter ID:10.10.0.5:8300 Address:10.10.0.5:8300}]
    2017/11/16 01:11:51 [INFO] raft: Node at 10.10.0.5:8300 [Follower] entering Follower state (Leader: "")
    2017/11/16 01:11:51 [INFO] serf: EventMemberJoin: 77d163627a13.dc1 10.10.0.5
    2017/11/16 01:11:51 [INFO] serf: EventMemberJoin: 77d163627a13 10.10.0.5
    2017/11/16 01:11:51 [INFO] serf: Attempting re-join to previously known node: a0b6f8818f0f.dc1: 10.10.0.3:8302
    2017/11/16 01:11:51 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.3:8302
    2017/11/16 01:11:51 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2017/11/16 01:11:51 [INFO] consul: Handled member-join event for server "77d163627a13.dc1" in area "wan"
    2017/11/16 01:11:51 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2017/11/16 01:11:51 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
    2017/11/16 01:11:51 [INFO] serf: Attempting re-join to previously known node: bd60e4b48df1: 10.10.0.5:8301
    2017/11/16 01:11:51 [INFO] consul: Adding LAN server 77d163627a13 (Addr: tcp/10.10.0.5:8300) (DC: dc1)
    2017/11/16 01:11:51 [INFO] consul: Raft data found, disabling bootstrap mode
    2017/11/16 01:11:51 [INFO] agent: Retry join LAN is supported for: aws azure gce softlayer
    2017/11/16 01:11:51 [INFO] agent: Joining LAN cluster...
    2017/11/16 01:11:51 [INFO] agent: (LAN) joining: [10.10.0.3 10.10.0.4 10.10.0.5]
    2017/11/16 01:11:51 [DEBUG] memberlist: Stream connection from=10.10.0.5:36142
    2017/11/16 01:11:51 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.5:8301
    2017/11/16 01:11:51 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.3:8301
    2017/11/16 01:11:51 [INFO] serf: Re-joined to previously known node: bd60e4b48df1: 10.10.0.5:8301
    2017/11/16 01:11:51 [INFO] serf: EventMemberJoin: 39f0a8e68863.dc1 10.10.0.3
    2017/11/16 01:11:51 [INFO] serf: Re-joined to previously known node: a0b6f8818f0f.dc1: 10.10.0.3:8302
    2017/11/16 01:11:51 [INFO] consul: Handled member-join event for server "39f0a8e68863.dc1" in area "wan"
    2017/11/16 01:11:51 [INFO] serf: EventMemberJoin: 39f0a8e68863 10.10.0.3
    2017/11/16 01:11:51 [INFO] consul: Adding LAN server 39f0a8e68863 (Addr: tcp/10.10.0.3:8300) (DC: dc1)
    2017/11/16 01:11:51 [DEBUG] memberlist: Failed to join 10.10.0.4: dial tcp 10.10.0.4:8301: getsockopt: connection refused
    2017/11/16 01:11:51 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.5:8301
    2017/11/16 01:11:51 [DEBUG] memberlist: Stream connection from=10.10.0.5:36148
    2017/11/16 01:11:51 [INFO] agent: (LAN) joined: 2 Err: <nil>
    2017/11/16 01:11:51 [DEBUG] agent: systemd notify failed: No socket
    2017/11/16 01:11:51 [INFO] agent: Join LAN completed. Synced with 2 initial agents
    2017/11/16 01:11:51 [DEBUG] memberlist: Stream connection from=10.10.0.4:55066
    2017/11/16 01:11:51 [INFO] serf: EventMemberJoin: 9bb130fdd405.dc1 10.10.0.4
    2017/11/16 01:11:51 [INFO] consul: Handled member-join event for server "9bb130fdd405.dc1" in area "wan"
    2017/11/16 01:11:51 [DEBUG] memberlist: Stream connection from=10.10.0.4:47810
    2017/11/16 01:11:51 [INFO] serf: EventMemberJoin: 9bb130fdd405 10.10.0.4
    2017/11/16 01:11:51 [INFO] consul: Adding LAN server 9bb130fdd405 (Addr: tcp/10.10.0.4:8300) (DC: dc1)
    2017/11/16 01:11:51 [DEBUG] serf: messageJoinType: 39f0a8e68863
    2017/11/16 01:11:51 [DEBUG] serf: messageJoinType: 9bb130fdd405
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 9bb130fdd405
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 77d163627a13
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 39f0a8e68863
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 9bb130fdd405
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 39f0a8e68863.dc1
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 39f0a8e68863.dc1
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 77d163627a13
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 9bb130fdd405
    2017/11/16 01:11:52 [DEBUG] serf: messageJoinType: 39f0a8e68863.dc1
2017/11/16 01:11:57 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:59918
    2017/11/16 01:11:57 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:11:57 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:11:57 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:11:57 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:11:57 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:11:57 [DEBUG] serf: messageUserEventType: consul:new-leader
2017/11/16 01:12:02 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:49868
    2017/11/16 01:12:02 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:02 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:12:02 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:02 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:02 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:02 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:02 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:04 [ERR] agent: failed to sync remote state: rpc error making call: No cluster leader
2017/11/16 01:12:12 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:54073
2017/11/16 01:12:12 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:51419
    2017/11/16 01:12:12 [WARN] raft: Rejecting vote request from 10.10.0.3:8300 since our last term is greater (178, 177)
2017/11/16 01:12:12 [ERR] raft-net: Failed to flush response: write tcp 10.10.0.5:8300->10.10.0.3:51419: write: connection reset by peer
    2017/11/16 01:12:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:12 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:12:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:18 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:18 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:18 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:12:18 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:18 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:18 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:20 [ERR] agent: failed to sync remote state: rpc error making call: No cluster leader
2017/11/16 01:12:24 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.4:57354
    2017/11/16 01:12:24 [WARN] raft: Rejecting vote request from 10.10.0.4:8300 since we have a leader: 10.10.0.3:8300
    2017/11/16 01:12:24 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:24 [INFO] consul: New leader elected: 9bb130fdd405
    2017/11/16 01:12:24 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:24 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:30 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.3:8301
2017/11/16 01:12:30 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.4:51019
    2017/11/16 01:12:30 [WARN] raft: Failed to get previous log: 275 log not found (last: 273)
    2017/11/16 01:12:30 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:30 [INFO] consul: New leader elected: 9bb130fdd405
    2017/11/16 01:12:30 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:30 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:30 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:31 [ERR] agent: Coordinate update error: rpc error making call: rpc error making call: No cluster leader
    2017/11/16 01:12:37 [WARN] raft: Rejecting vote request from 10.10.0.3:8300 since we have a leader: 10.10.0.4:8300
2017/11/16 01:12:37 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:40088
    2017/11/16 01:12:37 [WARN] raft: Failed to get previous log: 276 log not found (last: 275)
    2017/11/16 01:12:37 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:37 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:12:37 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:37 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:37 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:37 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:44 [ERR] agent: failed to sync remote state: rpc error making call: rpc error making call: No cluster leader
    2017/11/16 01:12:44 [WARN] raft: Failed to get previous log: 277 log not found (last: 276)
    2017/11/16 01:12:44 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:44 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:12:44 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:45 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:45 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:45 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:54 [WARN] raft: Failed to get previous log: 278 log not found (last: 277)
    2017/11/16 01:12:54 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:54 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:12:54 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:54 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:12:54 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:00 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.3:8301
    2017/11/16 01:13:01 [ERR] agent: failed to sync remote state: rpc error making call: No cluster leader
    2017/11/16 01:13:01 [WARN] raft: Failed to get previous log: 280 log not found (last: 278)
2017/11/16 01:13:01 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:43950
    2017/11/16 01:13:01 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:01 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:13:01 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:01 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:01 [ERR] agent: Coordinate update error: rpc error making call: No cluster leader
    2017/11/16 01:13:06 [DEBUG] memberlist: Initiating push/pull sync with: 10.10.0.3:8302
    2017/11/16 01:13:07 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:07 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:13:07 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:07 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:07 [DEBUG] serf: messageUserEventType: consul:new-leader
2017/11/16 01:13:12 [DEBUG] raft-net: 10.10.0.5:8300 accepted connection from: 10.10.0.3:42619
    2017/11/16 01:13:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:12 [INFO] consul: New leader elected: 39f0a8e68863
    2017/11/16 01:13:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:12 [DEBUG] serf: messageUserEventType: consul:new-leader
    2017/11/16 01:13:14 [ERR] http: Request GET /v1/coordinate/nodes?dc=dc1&token=<hidden>, error: rpc error making call: No cluster leader from=10.255.0.3:27515
    2017/11/16 01:13:14 [DEBUG] http: Request GET /v1/coordinate/nodes?dc=dc1&token=<hidden> (14.799510623s) from=10.255.0.3:27515

image

So I tried myself and got a working and testable compose file inside play with docker.

version: "3.4"

networks:
  consul:
    driver: overlay
    ipam:
      config:
      - subnet: 172.20.0.0/24 

services:
  consul:
    image: consul:${CONSUL_VERSION:-latest}
    networks:
      - consul
    ports:
      - "8500:8500"
    volumes:
      - /var/lib/consul/data:/consul/data
    environment:
        - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"cpg-dc", "acl_default_policy":"deny", "acl_down_policy":"extend-cache", "datacenter":"cpg-dc", "data_dir":"/consul/data", "server":true }'
        - CONSUL_BIND_INTERFACE=eth0
    command: agent -ui -data-dir /consul/data -server -bootstrap-expect 5 -client 0.0.0.0 -log-level debug -retry-join 172.20.0.3 -retry-join 172.20.0.4 -retry-join 172.20.0.5 -retry-join 172.20.0.6 -retry-join 172.20.0.7
    deploy:
        replicas: 5
        placement:
            constraints: [node.role == manager]
        resources:
          limits:
            cpus: '0.50'
            memory: 1024M
          reservations:
            cpus: '0.50'
            memory: 128M
        restart_policy:
            condition: on-failure
            delay: 5s
            max_attempts: 3
            window: 120s
        update_config:
            parallelism: 1
            delay: 10s
            failure_action: continue

But it doesn't solve the fixed IP's problem you mentioned before.

gaba-xyz commented 6 years ago

So does someone have a robust Consul compose file that will be able to handle containers being potentially rescheduled to a newer host? Because we're also running into this problem of fixed IPs.

bhavikkumar commented 6 years ago

This configuration seems to work for me to spin up consul servers.

---
version: '3.3'

networks:
  consul-network:

services:
  server:
    image: consul:latest
    networks:
      consul-network:
        aliases:
          - consul.server
    command: "consul agent -config-file /consul/config/config.json"
    ports:
      - target: 8500
        published: 8500
        mode: host
    volumes:
      - /opt/consul:/consul/config
    deploy:
      mode: replicated
      replicas: 3
      endpoint_mode: dnsrr
      update_config:
        parallelism: 1
        failure_action: rollback
        delay: 30s
      restart_policy:
        condition: any
        delay: 5s
        window: 120s
      placement:
        constraints:
          - node.role == manager

Consul Configuration

{
  "advertise_addr" : "{{ GetInterfaceIP \"eth0\" }}",
  "bind_addr": "{{ GetInterfaceIP \"eth0\" }}",
  "client_addr": "0.0.0.0",
  "data_dir": "/consul/data",
  "datacenter": "us-west-2",
  "leave_on_terminate" : true,
  "retry_join" : [
    "consul.server"
  ],
  "server_name" : "server.us-west-2.consul",
  "skip_leave_on_interrupt" : true,
  "bootstrap_expect": 3,
  "server" : true,
  "ui" : true,
  "autopilot": {
    "cleanup_dead_servers": true
  },
  "disable_update_check": true
}
gaba-xyz commented 6 years ago

@bhavikkumar does it handle re-elections if the leader node is lost? That has been the major issue I have had with almost all of these configurations

bhavikkumar commented 6 years ago

@Gabology it seems to work fine during my testing. I terminated the leader ec2 instance which caused a graceful exit. When the ASG brought up another instance and joined the swarm, the consul server container joined without any issues.

I also then ran docker rm -f <container id> on the leader to cause a unexpected termination and the logs outputted the following, unfortunately I did not have the log level high enough. So I called the following API /v1/status/leader which showed the leader changing from 10.0.0.4 to 10.0.0.8.

    2017/11/26 09:47:14 [WARN] raft: Heartbeat timeout from "10.0.0.4:8300" reached, starting election
    2017/11/26 09:47:14 [WARN] raft: AppendEntries to {Voter 70b402a2-f59f-fb65-af38-9371f7a67c7a 10.0.0.4:8300} rejected, sending older logs (next: 1)
    2017/11/26 09:47:16 [WARN] consul: error getting server health from "7d321cd52b1f": last request still outstanding
    2017/11/26 09:47:16 [WARN] raft: AppendEntries to {Nonvoter ae5ea9fa-bc12-0ce4-6e99-b0714996e0d4 10.0.0.4:8300} rejected, sending older logs (next: 1689)
    2017/11/26 09:47:16 [ERR] memberlist: Failed fallback ping: write tcp 10.0.0.8:49736->10.0.0.4:8301: i/o timeout
    2017/11/26 09:47:17 [ERR] memberlist: Failed fallback ping: EOF
gaba-xyz commented 6 years ago

@bhavikkumar Still losing quorum when using your setup. It seems it's not cleaning up the dead servers after they've been terminated because I'm seeing a lot of log entries like this after the leader is killed:

consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:40 [WARN] raft: Election timeout reached, restarting election
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:40 [INFO] raft: Node at 10.0.8.29:8300 [Candidate] entering Candidate state in term 344
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    | 2017/12/20 09:54:40 [WARN] Unable to get address for server id 31e75198-70a1-1907-daad-8938506f8ad5, using fallback address 10.0.8.17:8300: Could not find address for server id 31e75198-70a1-1907-daad-8938506f8ad5
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    | 2017/12/20 09:54:40 [WARN] Unable to get address for server id 3a24c82d-6c48-2aed-1bcc-3f74d9d19f30, using fallback address 10.0.8.7:8300: Could not find address for server id 3a24c82d-6c48-2aed-1bcc-3f74d9d19f30
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:40 [ERR] raft: Failed to make RequestVote RPC to {Voter 3a24c82d-6c48-2aed-1bcc-3f74d9d19f30 10.0.8.7:8300}: dial tcp 10.0.8.29:0->10.0.8.7:8300: getsockopt: no route to host
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:40 [WARN] consul: error getting server health from "bfcefdee6fd1": rpc error getting client: failed to get conn: dial tcp 10.0.8.29:0->10.0.8.7:8300: getsockopt: no route to host
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:41 [WARN] consul: error getting server health from "bfcefdee6fd1": context deadline exceeded
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:42 [WARN] consul: error getting server health from "bfcefdee6fd1": last request still outstanding
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:43 [ERR] raft: Failed to make RequestVote RPC to {Voter 31e75198-70a1-1907-daad-8938506f8ad5 10.0.8.17:8300}: dial tcp 10.0.8.29:0->10.0.8.17:8300: getsockopt: no route to host
consul_cluster.3.13zb0cxj8cqv@swarm-worker-ppx2    |     2017/12/20 09:54:43 [WARN] consul: error getting server health from "bfcefdee6fd1": rpc error getting client: failed to get conn: dial tcp 10.0.8.29:0->10.0.8.7:8300: getsockopt: no route to host
soakes commented 6 years ago

I am also looking for a good configuration which handles reconnections when the quorum is lost. I have a kinda working solution but it involves hard encoding the hostnames which aren't ideal. I would like to set a number of replicas and it then sorts itself out. Does anyone have any other suggestions? other then whats listed here?

Heres my configs below

---
version: "3.4"

networks:
  consul:
    external: true

services:
  manage01:
    image: consul:1.0.2
    networks:
      - consul
    volumes:
      - /var/lib/consul/data:/consul/data
    environment:
        - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"mydc", "acl_default_policy":"allow", "acl_down_policy":"extend-cache", "datacenter":"mydc", "encrypt":"GEph2hPCk9GFM39iE2MiLA==", "data_dir":"/consul/data", "server":true }'
        - CONSUL_BIND_INTERFACE=eth0
    command: agent -ui -data-dir /consul/data -server -bootstrap -client 0.0.0.0 -retry-join manage02 -retry-join manage03
    deploy:
        placement:
            constraints: [node.hostname == manage01]
        resources:
          limits:
            cpus: '0.50'
            memory: 1024M
          reservations:
            cpus: '0.50'
            memory: 128M
        restart_policy:
            condition: on-failure
            delay: 5s
            max_attempts: 3
            window: 120s
        update_config:
            parallelism: 1
            delay: 10s
            failure_action: continue

  manage02:
    image: consul:1.0.2
    networks:
      - consul
    volumes:
      - /var/lib/consul/data:/consul/data
    environment:
        - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"mydc", "acl_default_policy":"allow", "acl_down_policy":"extend-cache", "datacenter":"mydc", "encrypt":"GEph2hPCk9GFM39iE2MiLA==", "data_dir":"/consul/data", "server":true }'
        - CONSUL_BIND_INTERFACE=eth0
    command: agent -ui -data-dir /consul/data -server -client 0.0.0.0 -retry-join manage01 -retry-join manage03
    deploy:
        placement:
            constraints: [node.hostname == manage02]
        resources:
          limits:
            cpus: '0.50'
            memory: 1024M
          reservations:
            cpus: '0.50'
            memory: 128M
        restart_policy:
            condition: on-failure
            delay: 5s
            max_attempts: 3
            window: 120s
        update_config:
            parallelism: 1
            delay: 10s
            failure_action: continue

  manage03:
    image: consul:1.0.2
    networks:
      - consul
    volumes:
      - /var/lib/consul/data:/consul/data
    environment:
        - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"mydc", "acl_default_policy":"allow", "acl_down_policy":"extend-cache", "datacenter":"mydc", "encrypt":"GEph2hPCk9GFM39iE2MiLA==", "data_dir":"/consul/data", "server":true }'
        - CONSUL_BIND_INTERFACE=eth0
    command: agent -ui -data-dir /consul/data -server -client 0.0.0.0 -retry-join manage01 -retry-join manage02
    deploy:
        placement:
            constraints: [node.hostname == manage03]
        resources:
          limits:
            cpus: '0.50'
            memory: 1024M
          reservations:
            cpus: '0.50'
            memory: 128M
        restart_policy:
            condition: on-failure
            delay: 5s
            max_attempts: 3
            window: 120s
        update_config:
            parallelism: 1
            delay: 10s
            failure_action: continue

This is really not ideal but it does work. Would prefer to set replicas: X and control that way. Is there any work around can anyone think?

TIA

bhavikkumar commented 6 years ago

@Gabology What is your setup? And how are you terminating nodes? I will try to replicate the problem and see if I cannot resolve it.

soakes commented 6 years ago

To make this complete, I have finally got a perfect working config which I hope will assist others and save them having a nasty headake like I have had.

This will setup consul on all your manager nodes but will restrict it to exactly one per manager node (which is what I wanted) as I have three managers. You can replace this with repelicas without causing a problem.

This also uses local volumes for persistent storage but you can replace them with proxworx or whatever volumes which I will do now that this is working. The key really is the alias on the network as well as dnsrr (DNS Round Robbin) which took a while to find in the docker docs. With this combo it will find a node to connect to which fixes the initial connection via DNS. I have tested this thoroughly by rebooting each node and they recover perfectly. This also has ACL support so enjoy.

HTH someone with fighting to get a reliable config.

docker network create -d overlay --opt com.docker.network.swarm.name=consul consul --subnet 172.30.22.0/28
---
version: "3.4"

networks:
  consul:
    external: true

volumes:
  consul:

services:
  server:
    image: consul:1.0.2
    volumes:
      - consul:/consul
    ports:
      - target: 8500
        published: 8500
        mode: host
    networks:
      consul:
        aliases:
          - consul.cluster
    environment:
      - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_down_policy":"allow", "acl_master_token":"********-****-****-****-********", "acl_agent_token":"********-******-****-*****-*****", "acl_datacenter":"dc1", "acl_default_policy":"deny", "datacenter":"dc1", "encrypt":"*****************==", "data_dir":"/consul/data", "server":true }'
      - CONSUL_BIND_INTERFACE=eth0
      - CONSUL_HTTP_TOKEN=*****-****-***-****-****
    command: agent -ui -data-dir /consul/data -server -client 0.0.0.0 -bootstrap-expect=3 -retry-join consul.cluster
    deploy:
      endpoint_mode: dnsrr
      mode: global
      placement:
        constraints: [node.role ==  manager]
gaba-xyz commented 6 years ago

@soakes Any particular reason that you are creating the network externally?

Anyhow. happy to say that this config works well for us as well. Just surprised about the dnsrr endpoint mode because I thought that only mattered for clients outside of the Docker network that were connecting.

soakes commented 6 years ago

@Gabology Yes there is a reason, I have several VPN connections which contain a fair few routes and sometimes when docker swarm creates a network on its own, it collides with a range which is used elsewhere. Sadly I can't change the other networks as they are not under my control so the solution is to give specific ranges to docker so it doesn't happen. Apart from that, there's no reason.

With regards to the dnsrr mode, it surprised me a bit too, was only after testing inside a container that I figured it out.

The only thing I don't currently have added in the SSL bits which I plan to do soon.

The below config is my full current configuration including auto-discovery and several worker nodes configured as consul clients. This runs a consul in client mode on anything other than manager nodes.

If anyone can think of some other useful tweaks or improvements, please let me know. Thank you.

---
version: "3.4"

networks:
  consul:
    external: true

volumes:
  consul:

services:
  server:
    image: consul:1.0.2
    volumes:
      - consul:/consul
    ports:
      - target: 8500
        published: 8500
        mode: host
    networks:
      consul:
        aliases:
          - consul.cluster
    environment:
      - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, 
      "acl_down_policy":"allow", 
      "acl_master_token":"*****-****-****-*****-*****", 
      "acl_agent_token":"****-****-****-****-***", 
      "acl_datacenter":"dc1", 
      "acl_default_policy":"deny", 
      "datacenter":"dc1", 
      "encrypt":"*****==", 
      "data_dir":"/consul/data", 
      "server":true }'
      - CONSUL_BIND_INTERFACE=eth0
      - CONSUL_HTTP_TOKEN=*****-****-****-*****-*****
    command: agent -ui -data-dir /consul/data -server -client 0.0.0.0 -bootstrap-expect=3 -retry-join consul.cluster
    deploy:
      endpoint_mode: dnsrr
      mode: global
      placement:
        constraints: [node.role ==  manager]

  client:
    image: consul:1.0.2
    volumes:
      - consul:/consul
    networks:
      consul:
        aliases:
          - consul.client.cluster
    environment:
      - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true,
      "acl_down_policy":"allow",
      "acl_master_token":"*****-*****-****-*****-*****",
      "acl_agent_token":"*****-****-****-****-******",
      "acl_datacenter":"dc1",
      "acl_default_policy":"deny",
      "datacenter":"dc1",
      "encrypt":"****==",
      "data_dir":"/consul/data" }'
      - CONSUL_BIND_INTERFACE=eth0
      - CONSUL_HTTP_TOKEN=****-****-****-***-*****
    command: agent -ui -data-dir /consul/data -client 0.0.0.0 -retry-join consul.cluster
    deploy:
      endpoint_mode: dnsrr
      mode: global
      placement:
        constraints: [node.role !=  manager]

  registrator:
    image: gliderlabs/registrator:master
    command: -internal consul://consul.cluster:8500
    volumes:
      - /var/run/docker.sock:/tmp/docker.sock
    networks:
      - consul
    environment:
      - CONSUL_HTTP_TOKEN=*******-****-*****-*****-*****    
    deploy:
      mode: global
bhavikkumar commented 6 years ago

@soakes Does the /consul/data have to be mounted? I posted a config earlier which looks extremely similar but @Gabology could not get it to work and this is the only different I can see.

gaba-xyz commented 6 years ago

@bhavikkumar I think the issue for me was that I hadn't set the acl_agent_token when running acl_default_policy: deny so the autopilot failed to remove dead nodes.

soakes commented 6 years ago

@bhavikkumar I haven't tried without mounting because IMO you want persistent data, however, I cant think of a reason why it wont work without as long as the config with the ACL keys etc is present. It's also very similar to others posted here because I was testing all configs here trying to find a good solution and so I kept the good that I found and added/removed bits to get it perfect.

soakes commented 6 years ago

Sorry to bother everyone but while doing some testing I seem to either of hit a bug or something.

I am having a problem where the DNS is blank so after looking further I find the API cmds to look it up but they are really showing up blank and I have no idea why. Whats interesting is that it can see the services as a list but thats all, you cant find any further info out. Does anyone have a clue what ive done wrong? My config is above.

It can't be an API key issue right now because im using the master key for testing. This is also set as an ENV and this is within the consul server conatiner.

I have also confirmed that I can pull info out of the KV store fine, but the DNS/PORT info seems to be missing which would explain why I cant get traefik form playing ball correctly. It works with certs using KV store but having to assign labels :(

Must be some config im missing.. anyone got any ideas? would be really grateful. Thank you.

I used the docs here for finding how to test: https://gliderlabs.com/registrator/latest/user/quickstart/

/ # curl http://172.30.1.200:8500/v1/catalog/services
/ # curl http://172.30.1.200:8500/v1/catalog/services
{"consul":[],"consul-8300":[],"consul-8301":["udp"],"consul-8302":["udp"],"consul-8500":[],"consul-8600":["udp"],"portainer":[],"traefik-443":[],"traefik-80":[],"traefik-8089":[],"whoami":[]}/ 
/ # curl http://172.30.1.200:8500/v1/catalog/services/whoami
/ # curl http://172.30.1.200:8500/v1/catalog/service/whoami
/ # curl http://172.30.1.200:8500/v1/catalog/service/whoami
/ # curl http://172.30.1.200:8500/v1/catalog/service/traefik-8089

TIA

isuftin commented 6 years ago

@askulkarni2 Sorry about the very delayed response in regards to https://github.com/hashicorp/docker-consul/issues/66#issuecomment-342292331

I supply multiple retry-joins because this way, each node will attempt to retry the same address a few times before considering the current join attempt a failure and moving on to the next attempt or failing out completely. This seems to work for me.

night-crawler commented 6 years ago

Also there is another problem with interface name if you have multiple networks. First container:

/ # ip route list
default via 172.24.0.1 dev eth3
10.0.0.0/24 dev eth1 scope link  src 10.0.0.64
---> 10.111.111.0/24 dev eth2 scope link  src 10.111.111.77
10.255.0.0/16 dev eth0 scope link  src 10.255.0.207
172.24.0.0/16 dev eth3 scope link  src 172.24.0.8

Other 2 containers:

/ # ip route list
default via 172.24.0.1 dev eth3
10.0.0.0/24 dev eth2 scope link  src 10.0.0.63
---> 10.111.111.0/24 dev eth1 scope link  src 10.111.111.76
10.255.0.0/16 dev eth0 scope link  src 10.255.0.206
172.24.0.0/16 dev eth3 scope link  src 172.24.0.10

I have different interface names for the same network. So it may be useful to add CONSUL_BIND_SUBNET env variable (in docker-entrypoint.sh):

if [ -n "$CONSUL_BIND_SUBNET" ]; then
    CONSUL_BIND_INTERFACE=$(ip route list | grep "$CONSUL_BIND_SUBNET" | cut -d' ' -f3)
fi

and provide extra env:

environment:
  - CONSUL_BIND_SUBNET=10.111.111.0/24
seffyroff commented 6 years ago

Hoo boy I've been scratching my head on this one all day. I'm trying to deploy a Consul KV store for Traefik on my Swarm, and it's being difficult. My compose file:

version: '3.4'

networks:
  proxy:
   external:
    name: proxy

services:
  consul:
    image: consul
    command: agent -server -bootstrap-expect=1 -log-level debug -ui-dir /ui
    environment:
    - CONSUL_CLIENT_INTERFACE=eth0
    - CONSUL_BIND_INTERFACE=eth0
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - /etc/localtime:/etc/localtime:ro
    networks:
      - proxy
    deploy:
     replicas: 1
     placement:
      constraints:
      - node.role == manager
      - node.hostname == rombus

The service logs appear to indicate that it's flapping: https://gist.github.com/seffyroff/505315aeacdb362a4311214ade0c0b39

looking at stderr I see this spam: ip: can't find device ''eth0'' flag provided but not defined: -client_addr then the usage help for consul

I've tried declaring ports or not, host/ingress combinations, adding config via json, standing on one leg. It's a persistent beast!

seffyroff commented 6 years ago

Following @soakes config, I got something that works for me finally. As he said, the DNS Round Robin and network alias were key, for me removing the CONSUL_CLIENT_INTERFACE variable and adding -client 0.0.0.0 to the command was the final puzzle piece that sorted it out.

aerohit commented 6 years ago

@soakes I am using the docker file that you posted on 23rd Dec with consul server, client and registrator. It works perfectly. Now suppose I am running a service which needs to communicate to consul, but I want to make sure that it talks to the consul running on the same docker host as the service (regardless of whether consul is running in server or client mode on that host), how can I achieve that?

Thanks a lot.

bhavikkumar commented 6 years ago

@aerohit The way I managed to get that working is by using using the Gateway IP of a bridge network. This is generally 172.17.0.1 for the default one but you can check by running docker network inspect bridge however it is recommended if you create your own bridge network.

User-defined bridge networks are superior to the default bridge network.

The documentation for this can be found at https://docs.docker.com/network/bridge/

Sispheor commented 6 years ago

@soakes How did you bootstrap acl_master_token, acl_agent_token and CONSUL_HTTP_TOKEN ?

Sispheor commented 6 years ago

Ok I found how to generate tokens.

Now my issue is:

2018/02/28 14:11:28 [INFO] serf: EventMemberJoin: 37df9a200b83 10.0.2.3
    2018/02/28 14:11:28 [INFO] agent: Started DNS server 0.0.0.0:8600 (udp)
    2018/02/28 14:11:28 [INFO] agent: Started DNS server 0.0.0.0:8600 (tcp)
    2018/02/28 14:11:28 [INFO] agent: Started HTTP server on [::]:8500 (tcp)
    2018/02/28 14:11:28 [ERR] agent: failed to sync remote state: No known Consul servers
    2018/02/28 14:11:28 [INFO] agent: started state syncer
    2018/02/28 14:11:28 [WARN] manager: No servers available
    2018/02/28 14:11:28 [INFO] agent: Retry join LAN is supported for: aliyun aws azure digitalocean gce os scaleway softlayer
    2018/02/28 14:11:28 [INFO] agent: Joining LAN cluster...
    2018/02/28 14:11:28 [INFO] agent: (LAN) joining: [consul.cluster consul.cluster]
    2018/02/28 14:11:28 [INFO] serf: EventMemberJoin: b432faedba5c 10.0.2.2
    2018/02/28 14:11:28 [INFO] consul: adding server b432faedba5c (Addr: tcp/10.0.2.2:8300) (DC: dc1)
    2018/02/28 14:11:29 [INFO] serf: EventMemberJoin: 68d5cb8ea9a6 10.0.2.4
    2018/02/28 14:11:29 [INFO] agent: (LAN) joined: 2 Err: <nil>
    2018/02/28 14:11:29 [INFO] agent: Join LAN completed. Synced with 2 initial agents
    2018/02/28 14:11:36 [ERR] consul: "Catalog.NodeServices" RPC failed to server 10.0.2.2:8300: rpc error making call: No cluster leader
    2018/02/28 14:11:36 [ERR] agent: failed to sync remote state: rpc error making call: No cluster leader
    2018/02/28 14:11:55 [ERR] consul: "Coordinate.Update" RPC failed to server 10.0.2.2:8300: rpc error making call: No cluster leader

Here is my docker compose

---
version: "3.4"

networks:
  consul:
    # external: true

volumes:
  consul:

services:
  server:
    image: consul
    volumes:
      - consul:/consul
      - ./config/consul.multi-node.server.json:/consul/config/consul.json
    ports:
      - target: 8500
        published: 8500
        mode: host
    networks:
      consul:
        aliases:
          - consul.cluster
    environment:      
      - CONSUL_BIND_INTERFACE=eth0
      - CONSUL_HTTP_TOKEN=32e3ed4d-93ba-44f9-a444-5a010b512528
    command: "agent -client 0.0.0.0 -config-file /consul/config/consul.json"
    deploy:
      endpoint_mode: dnsrr
      mode: global
      placement:
        constraints: [node.role ==  manager]

  client:
    image: consul
    volumes:
      - consul:/consul
      - ./config/consul.multi-node.client.json:/consul/config/consul.json      
    networks:
      consul:
        aliases:
          - consul.client.cluster
    environment:      
      - CONSUL_BIND_INTERFACE=eth0
      - CONSUL_HTTP_TOKEN=32e3ed4d-93ba-44f9-a444-5a010b512528
    command: "agent -client 0.0.0.0 -config-file /consul/config/consul.json"
    deploy:
      endpoint_mode: dnsrr
      mode: global
      placement:
        constraints: [node.role !=  manager]

  registrator:
      image: gliderlabs/registrator:master
      command: -internal consul://consul.cluster:8500
      volumes:
        - /var/run/docker.sock:/tmp/docker.sock
      networks:
        - consul
      environment:
        - CONSUL_HTTP_TOKEN=32e3ed4d-93ba-44f9-a444-5a010b512528    
      deploy:
        mode: global

With consul config server:

{   
    "server": true,
    "skip_leave_on_interrupt": true, 
    "acl_down_policy":"allow", 
    "acl_master_token":"8b8cbb0c-1c88-11e8-accf-0ed5f89f718b", 
    "acl_agent_token":"8b8cbf26-1c88-11e8-accf-0ed5f89f718b", 
    "acl_datacenter":"dc1", 
    "acl_default_policy":"deny", 
    "datacenter":"dc1", 
    "encrypt":"7jGmTVfQ6WXmUUDVQS2yFQ==",  
    "data_dir":"/consul/data", 
    "ui" : true,
    "bootstrap_expect": 3,
    "retry_join": ["consul.cluster"]
}

And client

{   
    "skip_leave_on_interrupt": true, 
    "acl_down_policy":"allow", 
    "acl_master_token":"8b8cbb0c-1c88-11e8-accf-0ed5f89f718b", 
    "acl_agent_token":"8b8cbf26-1c88-11e8-accf-0ed5f89f718b", 
    "acl_datacenter":"dc1", 
    "acl_default_policy":"deny", 
    "datacenter":"dc1", 
    "encrypt":"7jGmTVfQ6WXmUUDVQS2yFQ==", 
    "data_dir":"/consul/data", 
    "ui" : true,
    "retry_join": ["consul.cluster"]
}
Sispheor commented 6 years ago

It's ok. I forgot to update the number of expected server.

dperetti commented 6 years ago

For the records, here is a simple stack for a 3 servers Docker Swarm that worked for me:

version: '3.6'

x-consul: &consul
    image: consul:latest
    volumes:
      - consul:/consul

volumes:
  consul:

services:
  client:
    <<: *consul
    command: "agent -retry-join server-bootstrap -client 0.0.0.0 -bind '{{ GetInterfaceIP \"eth0\" }}'"
    depends_on:
      - server-bootstrap
    deploy:
      replicas: 2

  server:
    <<: *consul
    ports:
      - "8500:8500"
    depends_on:
      - server-bootstrap
    command: "agent -server -retry-join server-bootstrap -client 0.0.0.0 -bind '{{ GetInterfaceIP \"eth0\" }}' -ui"
    deploy:
      replicas: 2
      placement:
        constraints: [node.role == manager]

  server-bootstrap:
    image: consul
    command: "agent -server -bootstrap-expect 3 -client 0.0.0.0 -bind '{{ GetInterfaceIP \"eth0\" }}'"
    deploy:
      placement:
        constraints: [node.role == manager]
isuftin commented 6 years ago

@dperetti As an aside, that's the first time I've seen someone use YAML merging in a compose file. Fascinating

dperetti commented 6 years ago

@isuftin https://docs.docker.com/v17.12/compose/compose-file/#extension-fields

prologic commented 5 years ago

The Stackfile by @soakes works great in a Docker Swarm cluster. THanks!

nicholasamorim commented 5 years ago

Is it possible with this setup to have consul agents outside the swarm connecting to the consul servers in the swarm?

fskroes commented 5 years ago

With the help of this post and other post that I have found on the internet I would like to share my solution to various problems that I have encountered. Maybe this can help some people.

I have tested this with deploying it in a stack and before hand created a network called consul; docker network create -d overlay --attachable consul

version: "3.4"

networks:
  consul:
    external: true

services:

  consul:
    image: consul:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    ports:
      - target: 8500
        published: 8500
        mode: host
    networks:
      consul:
        aliases:
          - consul.cluster
    environment:
      - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"mydc", "acl_default_policy":"allow", "acl_down_policy":"extend-cache", "datacenter":"mydc", "encrypt":"GEph2hPCk9GFM39iE2MiLA==", "data_dir":"/consul/data", "server":true }'
      - CONSUL_BIND_INTERFACE=eth0
    command: "agent -ui -server -bootstrap -client 0.0.0.0 -retry-join consul.client -retry-join consul.client2"
    deploy:
      placement:
        constraints: [node.role == consul]
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: continue

  client:
    image: consul:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      consul:
        aliases:
          - consul.client
    environment:      
      - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"mydc", "acl_default_policy":"allow", "acl_down_policy":"extend-cache", "datacenter":"mydc", "encrypt":"GEph2hPCk9GFM39iE2MiLA==", "data_dir":"/consul/data", "server":true }'
      - CONSUL_BIND_INTERFACE=eth0
    command: "agent -ui -server -client 0.0.0.0 -retry-join consul.cluster -retry-join consul.client2"
    deploy:
      placement:
        constraints: [node.role == client]
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: continue

  client2:
    image: consul:latest
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
    networks:
      consul:
        aliases:
          - consul.client2
    environment:      
      - 'CONSUL_LOCAL_CONFIG={ "skip_leave_on_interrupt": true, "acl_datacenter":"mydc", "acl_default_policy":"allow", "acl_down_policy":"extend-cache", "datacenter":"mydc", "encrypt":"GEph2hPCk9GFM39iE2MiLA==", "data_dir":"/consul/data", "server":true }'
      - CONSUL_BIND_INTERFACE=eth0
    command: "agent -ui -server -client 0.0.0.0 -retry-join consul.cluster -retry-join consul.client"
    deploy:
      placement:
        constraints: [node.role == client2]
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s
      update_config:
        parallelism: 1
        delay: 10s
        failure_action: continue

  registrator:
      image: gliderlabs/registrator:latest
      command: -internal consul://consul.cluster:8500
      volumes:
        - /var/run/docker.sock:/tmp/docker.sock
      networks:
        - consul
      depends_on:
        - "consul"
      deploy:
        mode: global
gbourgeat commented 5 years ago

Hi @soakes @fskroes,

With your sample config, I stay on agent consul error for resolving network aliases 'consul.cluster'

  • Failed to resolve consul.cluster: lookup consul.cluster on 127.0.0.11:53: no such host

I looking for what I missed :(

Thank a lot if someone has solution :)

Strum355 commented 4 years ago

@nicholasamorim yes thats possible, and thats what ive done. Ive got consul client agents outside swarm, one per host, and then a recommended amount of consul server agents in swarm