consul service deregistration fails if the mesos slave is down

gena01 commented 7 years ago

It seems that it tries to connect to the node that "used" to run the task to deregister the service from it.

What if you run a multi-master cluster consul setup and with local agents running on each box and you lost the slave that was running the service? The service is never deregistered and when running with log level DEBUG it seems that it will try forever to deregister the service by continuing to hit the old IP address of where the service was running.

evilezh commented 7 years ago

And if you do not have consul-timeout option (defaults 30 seconds) .. and you have plenty of such hosts, then your service refresh might finish cycle in 10+ minutes. Even, I configure 1 second timeout and have 20+ failed nodes ... time is too large.

I think, it might be useful to split registration and de-registration, each in own thread. Even more .. de-registration could have two threads/lists:

hot (new de-registrations)
retry (failed de-registrations) Or another way is to implement pushback on failed entries.

gena01 commented 7 years ago

There's another option to consider. I run local consul agent (in client/agent mode) which talks to a mult9-master cluster )(recommended HA setup). So ideally the registration/deregistration should flow through the local agent. (maybe another reason to override/force a specific consul IP address when running mesos-consul)

ChrisAubuchon commented 7 years ago

If you push registration through a local agent then all of the services are linked with that agent. So if that agent becomes unavailable, all of the services that were registered through that agent become unavailable.

gena01 commented 7 years ago

Consul docs has a guide specifically to overcome the local agent owning service(s) case: https://www.consul.io/docs/guides/external.html

ChrisAubuchon commented 7 years ago

When I originally wrote mesos-consul I used the catalog for registration. The services were always getting removed by anti-entropy.

gena01 commented 7 years ago

I see. I had a mesos slave cluster on top of ec2 spot instances (which would/could get killed any time). This would leave quite a bit of a mess in consul. Checking the log would show that mesos-consul was unable to contact the agent on the killed nodes.

caussourd commented 7 years ago

I have the same issue: mesos-consul is so busy trying to deregister services that it doesn't register new ones. Even if the agent leaves the consul cluster, mesos-consul tries to deregister services that were running on that node. It's trying to deregister services that are already deregistered. Is that normal behaviour? If there is a way to get around this issue, please let me know. Cheers

evilezh commented 7 years ago

Let's summarize: Servers in cloud is very dynamic thing. They go up and down and we have short living servers (temporary for some task), which joins for hour or two to cluster and then leaves etc.

I'm still convinced that my first idea separating to different threads is best way to manage.

we want robust registration
we want robust de-registration (hot list )
and some servers might just reboot (retry list)
and finally after day or two they might disappear from retry list as well.

I can give real example :) AWS .. small spot fleet around 100 cores, running around 200 tasks :) I need to upgrade base image etc. It's dev - I chose simple way .. all down, all up. Start second fleet, terminate fist fleet. In 1-2 minutes all 200 tasks re-scheduled on new fleet. But now we have big problem with registration. I've 1 second timeout configured (down from 30 :) ) ... 200 tasks ... 200 seconds as minimum .. each run ... ;)

I think if we would start just splitting to two tasks or doing some async stuff would improve performance greatly.

evilezh commented 7 years ago

BTW, simple solution, would it be possible to add option to

do only registration
do only de-registration.

I could run two of them .. one would only register, another de-register :)

txbm commented 7 years ago

Has anyone solved this? I'm having this issue as well when Mesos loses a slave.

mantl / mesos-consul

consul service deregistration fails if the mesos slave is down #98