mantl / mesos-consul

Mesos to Consul bridge for service discovery
Apache License 2.0
338 stars 95 forks source link

mesos-consul keep on connecting to terminated mesos-agents, causing high latency in updating records to consul #124

Open rahul9999 opened 6 years ago

rahul9999 commented 6 years ago

Our environment is on aws ec2 with mesos cluster size of 150 nodes. Nodes scaling is managed by aws autoscaling. We are using latest version of mesos-consul 0.4.0.

We are seeing issue with mesos-consul when a node is terminated (by ASG or by other operational activities)-

Workaround is re-start mesos-consul which then re-build consul data from scratch (not a desired solution). meso-consul-deregistrationerror-timeouts.txt

Example logs (also in attachment): time="2018-01-12T12:50:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11061: dial tcp 110.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:50:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092" time="2018-01-12T12:50:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9092: dial tcp 10.65.xx.38:8500: i/o timeout" time="2018-01-12T12:50:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973" time="2018-01-12T12:51:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kibana-blackbox:172.xx.xx.5:42973: dial tcp 10.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:51:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910" time="2018-01-12T12:51:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49910: dial tcp 10.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:51:55Z" level=info msg="Deregistering mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17" time="2018-01-12T12:52:25Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:mesos:6d3301d5-2bf1-439f-979d-b5ec5eb49312-S75:10.65.44.17: dial tcp 10.xx.xx.17:8500: i/o timeout" time="2018-01-12T12:52:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094" time="2018-01-12T12:52:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:kafka:172.xx.xx.6:9094: dial tcp 10.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:52:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911" time="2018-01-12T12:53:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault:172.xx.xx.3:49911: dial tcp 10.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:53:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059" time="2018-01-12T12:53:55Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.38:tvault-ccp:172.xx.xx.2:11059: dial tcp 10.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:10.xx.xx.11:uat-dcp-api-profile-uat-dcp:172.xx.xx.13:25553" time="2018-01-12T12:53:55Z" level=info msg="Deregistering mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.xx.xx.38" time="2018-01-12T12:54:25Z" level=info msg="Deregistration error Put http://10.xx.xx.38:8500/v1/agent/service/deregister/mesos-consul:mesos:e809550e-b1d9-44ae-8636-bf21c2d06e8a-S29:10.65.228.38: dial tcp 10.xx.xx.38:8500: i/o timeout" time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.33:msg-ui:172.xx.xx.11:61067" time="2018-01-12T12:54:25Z" level=info msg="Deregistering mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829" time="2018-01-12T12:54:55Z" level=info msg="Deregistration error Put http://10.xx.xx.17:8500/v1/agent/service/deregister/mesos-consul:10.xx.xx.17:telemetry:172.xx.xx.6:31829: dial tcp 10.xx.xx.17:8500: i/o timeout" time="2018-01-12T12:54:55Z" level=info msg="Zookeeper leader: 10.xx.xx.202:5050" time="2018-01-12T12:54:55Z" level=info msg="reloading from master 10.xx.xx.202" time="2018-01-12T12:54:55Z" level=info msg="Running parseState" time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])" time="2018-01-12T12:54:55Z" level=info msg="Host found. Comparing tags: ([agent follower], [agent follower])"

We are requesting a configuration parameter like 'max-attempts for failed nodes' which can be used to control the connection attempt count and when that count is reached mesos-consul should remove the failed node from its memory i.e should never ever connect again.

-Rahul