Open gena01 opened 7 years ago
And if you do not have consul-timeout option (defaults 30 seconds) .. and you have plenty of such hosts, then your service refresh might finish cycle in 10+ minutes. Even, I configure 1 second timeout and have 20+ failed nodes ... time is too large.
I think, it might be useful to split registration and de-registration, each in own thread. Even more .. de-registration could have two threads/lists:
There's another option to consider. I run local consul agent (in client/agent mode) which talks to a mult9-master cluster )(recommended HA setup). So ideally the registration/deregistration should flow through the local agent. (maybe another reason to override/force a specific consul IP address when running mesos-consul)
If you push registration through a local agent then all of the services are linked with that agent. So if that agent becomes unavailable, all of the services that were registered through that agent become unavailable.
Consul docs has a guide specifically to overcome the local agent owning service(s) case: https://www.consul.io/docs/guides/external.html
When I originally wrote mesos-consul
I used the catalog for registration. The services were always getting removed by anti-entropy.
I see. I had a mesos slave cluster on top of ec2 spot instances (which would/could get killed any time). This would leave quite a bit of a mess in consul. Checking the log would show that mesos-consul was unable to contact the agent on the killed nodes.
I have the same issue: mesos-consul is so busy trying to deregister services that it doesn't register new ones. Even if the agent leaves the consul cluster, mesos-consul tries to deregister services that were running on that node. It's trying to deregister services that are already deregistered. Is that normal behaviour? If there is a way to get around this issue, please let me know. Cheers
Let's summarize: Servers in cloud is very dynamic thing. They go up and down and we have short living servers (temporary for some task), which joins for hour or two to cluster and then leaves etc.
I'm still convinced that my first idea separating to different threads is best way to manage.
I can give real example :) AWS .. small spot fleet around 100 cores, running around 200 tasks :) I need to upgrade base image etc. It's dev - I chose simple way .. all down, all up. Start second fleet, terminate fist fleet. In 1-2 minutes all 200 tasks re-scheduled on new fleet. But now we have big problem with registration. I've 1 second timeout configured (down from 30 :) ) ... 200 tasks ... 200 seconds as minimum .. each run ... ;)
I think if we would start just splitting to two tasks or doing some async stuff would improve performance greatly.
BTW, simple solution, would it be possible to add option to
I could run two of them .. one would only register, another de-register :)
Has anyone solved this? I'm having this issue as well when Mesos loses a slave.
It seems that it tries to connect to the node that "used" to run the task to deregister the service from it.
What if you run a multi-master cluster consul setup and with local agents running on each box and you lost the slave that was running the service? The service is never deregistered and when running with log level DEBUG it seems that it will try forever to deregister the service by continuing to hit the old IP address of where the service was running.