mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

backoffSeconds/backoffFactor other than 1/1.15 don't restart services after complete slave failures #1312

Closed bacoboy closed 7 years ago

bacoboy commented 9 years ago

While issue https://github.com/mesosphere/marathon/issues/616 appears to be closed, we came across an interesting case where we see the same behavior. We launched 2 different docker applications via marathon with the only difference apparently being which docker container it used. Some time down the road we had to recreate all the slaves (they were VMs and needed to update the base image). They came back with the same server name (not that I think that matters, but mentioning just in case I'm wrong). While they were down the marathon console showed 0/3 running for each of the applications (as expected). Once we rekicked the slaves, they joined the mesos cluster, but 1 of the applications started and the other didn't. It just stayed at 0/3 (Running) and never switched to Deploying (like the one that had recovered). Scaling to some other number (0 or 2 or 4) would get things moving again (which is why it seems related to issue 616).

Digging deeper, we found one other difference in the 2 marathon configurations. The one that recovered used backoffSeconds=1, backoffFactor=1.15 (as apparently all tests in the code use as well). The one that didn't recover was using backoffSeconds=5, backoffFactor=2.

My first hunch was there was some kind of int/decimal issue going on because backoffFactor was a decimal in the code and that perhaps there was a strange autoboxing issue that would cause marathon to calculate an infinite wait or something. We set the backoffSeconds to 1 and the backoffFactor to 1.15 (like the service that did recover), rekicked the slaves and this time BOTH applications came back up when the slaves re-registered.

So the next thing we tried was to use backoffSeconds=5, backoffFactor=2.5. Didn't work. Then backoffSeconds=1, backoffFactor=2.5. Didn't work. Then backoffSeconds=5, backoffFactor=1.15. Didn't work.

We couldn't find ANY combination other than the magical 1/1.15 mix that caused things to recover.

We are using the mesosphere packages (on a centos7 base):

To sidestep this issue, we've started using the 1/1.15 magical-ratio to get past this, but since it was 100% reproducible by adjusting these settings, thought we'd file a ticket in hopes a root cause could be found.

I'll also mention that with multiple slaves, killing 1 at a time had the services jumping to the remaining (as expected) and since most people would be rolling slaves in situations like this (rather than taking all down at once), this particular case may not come to light often.

Thanks to @andutta for helping to track this down.

aquamatthias commented 9 years ago

Hey @bacoboy there were some recent changes to the rate limiting behavior. Can you try this out with the current version 0.9.0-RC1? Does this problem still exists?

bacoboy commented 9 years ago

Didn't forget about this, but we are only up to marathon 0.8.2. I'll test when we get to 0.9.X

meichstedt commented 7 years ago

Note: This issue has been migrated to https://jira.mesosphere.com/browse/MARATHON-3373. For more information see https://groups.google.com/forum/#!topic/marathon-framework/khtvf-ifnp8.