hashicorp / consul

Consul is a distributed, highly available, and data center aware solution to connect and configure applications across dynamic, distributed infrastructure.
https://www.consul.io
Other
28.24k stars 4.42k forks source link

Consul quorum problem in container orchestration environments #2558

Closed ghost closed 7 years ago

ghost commented 7 years ago

Background: When running consul in container orchestration environment, and agent fails ungracefully (like OOM issues), There is a good chance that the scheduler (marathon) will reallocate the agent in a different host with different IP. Problem: For 72 hours, the other agents in the cluster are trying to reach the dead agent. The newly allocated agent is added as a new node, not a replacement (which is the real purpose). This results in a four nodes cluster (3 healthy, 1 dead). Happens again? Five nodes (3 healthy, two dead) and so on until you pass consul failure tolerance limits.

image

Any idea/solution? A configuration that I'm missing?

Server: consul:0.7.1 (official Docker image)

Configuration:

{ "id": "/consul-performance/consul", "args": [ "agent", "-server", "-data-dir=/consul/data", "-ui", "-retry-join=consul.consul-performance.marathon.mesos", "-retry-max=10", "-client=0.0.0.0", "-bootstrap-expect=1" ], "env": { "CONSUL_BIND_INTERFACE": "eth0", "CONSUL_LOCAL_CONFIG": "{\"telemetry\": { \"statsite_address\": \"consul-performancestatsite.marathon.l4lb.thisdcos.directory:8135\"},\"leave_on_terminate\": true,\"ports\":{\"http\": 10163,\"dns\": 10166,\"server\": 10164,\"serf_wan\": 10162,\"serf_lan\": 10161,\"rpc\": 10165}}", "GOGC": "50" }, "instances": 0, "cpus": 2, "mem": 512, "disk": 0, "maxLaunchDelaySeconds": 3600, "container": { "docker": { "image": "consul:0.7.1", "forcePullImage": false, "privileged": false, "network": "HOST" } }, "healthChecks": [ { "protocol": "HTTP", "path": "/v1/status/leader", "gracePeriodSeconds": 300, "intervalSeconds": 60, "timeoutSeconds": 20, "maxConsecutiveFailures": 3, "ignoreHttp1xx": false } ], "portDefinitions": [ { "protocol": "tcp", "port": 10163 }, { "protocol": "tcp", "port": 10161 }, { "protocol": "tcp", "port": 10162 }, { "protocol": "tcp", "port": 10164 }, { "protocol": "tcp", "port": 10166 }, { "protocol": "tcp", "port": 10165 } ], "requirePorts": true }

Operating system and Environment details

DC/OS mesos

slackpad commented 7 years ago

Hi @jacob-koren please take a look at this post with some details about this problem - https://groups.google.com/d/msg/consul-tool/64i0fZ5p3sA/QS2NvINFDQAJ. We should have some automation in Consul 0.8 that makes this easier to manage!