mesosphere / marathon

Deploy and manage containers (including Docker) on top of Apache Mesos at scale.
https://mesosphere.github.io/marathon/
Apache License 2.0
4.07k stars 845 forks source link

Implement graceful shutdown #712

Closed guenter closed 8 years ago

guenter commented 9 years ago

The current shutdown sequence in Mesos is to send a TERM, followed by a KILL 3 seconds later if the process is still running. Some apps have more advanced requirements like deregistering from a load balancer before shutting down. This thread is to collect requirements for a graceful shutdown feature.

guenter commented 9 years ago

There is a patch in flight that makes the timeout configurable: https://reviews.apache.org/r/25434/

ZiaBhatti commented 9 years ago

Signals and few seconds are not easy to manage. For TCP based services, there may not be a better option but I don't know how this would work for docker tasks. For HTTP services, my suggestion would be to add an option, just like the health check, to leverage a shutdown URI. That way the application can take the necessary time to drain and possibly stop responding to health checks.

sttts commented 9 years ago

I linked a ticket from the bamboo haproxy project. It looks like we probably also need some kind of "draining" state of a task with corresponding status events, in order to make bamboo and similar projects behave well on shutdown of tasks.

guenter commented 9 years ago

@sttts that's a nice clean way to handle it.

sielaq commented 9 years ago

It would be cool if there will be configurable pre-killing task like: reach instance:port/kill URL before killing - so we can implement better logic in instance application (for example we will mark self health checks as unhealthy -> it would get out from HAproxy) and of course this configurable timeout will be also needed.

bobrik commented 9 years ago

1147 was a duplicate, copying some ideas:

Apps that are super-fast to deploy (like golang apps in docker containers on top of busybox image that are already downloaded) can result in state when all old tasks are killed, but no new tasks are present in service discovery because update didn't happen yet.

I think we could ease transition to mesos/marathon with some timeout for tasks when they are not exposed in marathon api for service discovery, but still running.

Hiding from api or marking as unhealthy before killing with configurable timeout would improve things a lot for apps with connect-request-response-disconnect apps (think php).

Clever apps can catch signals and do draining shutdown, with configurable timeout between SIGTERM and SIGKILL per app that should work pretty well.

dlaidlaw commented 9 years ago

Ideally I would like something like this:

Marathon asks instances to stop themselves by sending a rest request (PUT to the shutdown path). The response is could be 200 OK, or 202 Accepted for instances willing to shutdown, or 409 Conflict to indicate that instance would prefer not to shutdown. If marathon really, really wants the instance to stop it sends a DELETE to the shutdown path. This one must be obeyed. Alternatives are a command or TCP, just like health checking.

Marathon would use the configured timeout to ensure the task actually died out. If it did not die within the timeout, then it is killed off using mesos as happens now.

If not enough instances respond with 200 OK or 202 Accepted, then marathon could retry through the task set for a configurable amount of time, and finally, if there are still more to stop, use the DELETE to make it unconditional.

This kind of process would allow instances to drain any current workload and stop themselves in an orderly fashion. Not only that, some instances doing important work could effectively say "pick someone else please".

You could do it without the option for declining a shutdown. In effect, you are going to make shutdown mandatory, and wait a configured amount of time for it to happen in an orderly fashion, and after that timeout just follow the current procedure. But having the option to decline a shutdown allows you to find idle instances of services. An idle service can shutdown immediately with no consequences.

ConnorDoyle commented 9 years ago

CC @BenWhitehead and @jsancio

kamilchm commented 9 years ago

As @bobrik said, I would stick with POSIX signals. Sending SIGTERM is a standard way to do this http://en.wikipedia.org/wiki/Unix_signal#POSIX_signals

The SIGTERM signal is sent to a process to request its termination. Unlike the SIGKILL signal, it can be caught and interpreted or ignored by the process. This allows the process to perform nice termination releasing resources and saving state if appropriate.

Applications controlled by other tools like http://supervisord.org/configuration.html#program-x-section-settings expects it

The number of seconds to wait for the OS to return a SIGCHILD to supervisord after the program has been sent a stopsignal. If this number of seconds elapses before supervisord receives a SIGCHILD from the process, supervisord will attempt to kill it with a final SIGKILL.

In addition, programming languages have standard mechanisms to handle shutdown http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook%28java.lang.Thread%29

Futhermore, such a solution is universal and not limited to HTTP apps only.

Is Mesos doesn't do it this way? https://github.com/apache/mesos/blob/2985ae05634038b70f974bbfed6b52fe47231418/src/slave/flags.hpp#L139

Ferdiism commented 9 years ago

see #1579

tysonnorris commented 9 years ago

I would suggest that a graceful shutdown (e.g. to allow removal from a load balancer) would also be part of the transition of old instances during upgrade deployment workflows.

I'm not sure if that was part of the original intention of this issue, but it's the use case I'm researching to accomplish better availability during upgrades (without independently managing multiple marathon apps etc).

mikljohansson commented 9 years ago

Currently trying to solve the graceful rolling upgrades problem and it'd help immensely if the upgrade & restart flow was adjusted as suggested in #1147, e.g.

  1. New tasks are gradually staged, started and become healthy
  2. Old tasks are gradually removed from Marathon API and event bus. Resulting in old tasks being removed quickly from service discovery.
  3. After some timeout old tasks are killed (i.e. using the regular Mesos SIGTERM -> graceperiod -> SIGKILL task shutdown process) .

Services need to trap and handle the SIGTERM signal gracefully, i.e. complete outstanding requests and exit nicely. Which as noted in previous comments is standard POSIX behavior anyway.

Currently trying to very sub-optimally solve the problem by

  1. Increasing the Mesos grace period parameters to e.g. executor_shutdown_grace_period=60 and docker_stop_timeout=55
  2. Trapping SIGTERM in services and immediately start failing the health check (e.g. return HTTP 503 Service Unavailable).
  3. Waiting 45s before actually exiting the service, so that service discovery has time to react. The idea being that both outstanding requests and new incoming requests (initiated before service discovery has reacted to the failing health check) can finish.
SEJeff commented 9 years ago

You know... if this hasn't been implemented, as a first stab, would you be willing to do the same thing that Apache Aurora does? This would certainly make it easy to move applications from Aurora to Marathon.

The Aurora documentation details how they do it. The gist of it is that the app needs to implement 3 things.

  1. A /health http endpoint, which returns a http status code of 200 and the literal text of ok
  2. A /quitquitquit endpoint which only accepts http POST requests. When the scheduler (read Marathon) sends a POST here, the application will start a graceful shutdown and immediately start failing any healthcheck to /health
  3. An /abortabortabort endpoint, which only accepts http POST requests. Ten seconds from when /quitquitquit is signaled, a second http POST request is made to /abortabortabort, which triggers the application to shut down immediately.
  4. After the timeout is hit for /abortabortabort, then the standard mesos SIGTERM, wait, SIGKILL conventions are followed.

Then in the aurora configuration for that application, it needs to specify a health port, which is the port where these http requests will be made to.

I've got an Aurora aware load balancer, aurproxy, doing http HEAD requests to the /health endpoint every 3 seconds. This results in Aurora being able to do rolling restarts and the load balancer draining instances before they are forcibly killed.

Note that the above three urls are configurable, and there is a patch in the works for the timeouts to be as well. It would be great if Marathon followed this same pattern which, IMO, works beautifully.

adomeier commented 9 years ago

@mikljohansson - Did increasing the Mesos grace period work for you? We are looking at tackling it that way for now as well.

Thanks for your thoughts!

EvanKrall commented 9 years ago

My ideal of a graceful shutdown sequence would have these properties:

An implementation which satisfies those could look like this: An app definition could include a set of commands / URIs (much like healthcheck cmd/uri) that Marathon would use to control this process:

For convenience, the is_safe_to_kill check should probably get passed the timestamp when draining started, so you could easy implement timeout-based draining. (Remove task from load balancer, wait for N seconds)

Marathon, when it wants to kill a task, would first call drain, then poll is_safe_to_kill until that returns true, then kill the task.

adomeier commented 9 years ago

Does anyone on this thread have a working fork they are using today? We're struggling to get a working rev that has a different shutdown grace period. Thanks!

mikljohansson commented 9 years ago

@adomeier sorry for being so slow to respond. We hadn't actually gotten around to trying out the workaround. I've done so now and it does work really well actually. I've written about what's needed to make it work on this page

https://github.com/meltwater/proxymatic#rolling-upgradesrestarts

In essence

adomeier commented 9 years ago

Thanks @mikljohansson ! That's super interesting. We ended up getting around it for now as a lot of our apps that maintain state actually spawn a child process so killing the parent allows our child jobs to finish gracefully with out the kill interrupting the transaction. We're continue down an active integration with the tech stack tho and I'm sure we'll see other issues with other apps. Really appreciate your response it's very helpful!

CC - @jolexa

sielaq commented 9 years ago

Yea we do exactly the same in our project, we trap SIGTERM https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/frameworks/start.sh#L59 which means that each container have to be started with wrapper.

mikljohansson commented 8 years ago

The workaround to trap SIGTERM and start failing the health check only works when having full control over the services, we have a lot of both our own and 3rd party services. It's also rather involved to implement in some frameworks/languages we've noticed. Neither is it very immediate, since the load balancers needs to wait up to HealthCheckDefinition.intervalSeconds to detect that the task is draining.

Our motivation for is that we'd like the ability to continuously roll out new app versions without interruption to service nor fail any requests while the rollout happens. To solve this without having to modify every single service I've investigated some possible routes. An idea I'd very much appreciate feedback on is:

Some pseudo-code that tries to express this

marathon.proto

message ServiceDefinition {
  ...
  optional int64 drainDelay = 26 [default = 0]; // stored as Millis
}

message MarathonTask {
  ...
  optional int64 drained_at = 12;
}

TaskTracker.scala

def getDefinition(taskId: String): ServiceDefinition = ...

def drainTask(taskId: String) {
  val builder = getTask(taskId).toBuilder
  builder.setDrainedAt(System.currentTimeMillis)
  updateTask(builder.build())

  if getDefinition(taskId).drainDelay == 0 {
    driver.killTask(taskId)
  }
}

def determineOverdueTasks(now: Timestamp): Iterable[MarathonTask] = {
  ...
  else if (task.hasDrainedAt && task.getDrainedAt < (nowMillis - getDefinition(task.getId).drainDelay)) {
    true
  }
}
kamilchm commented 8 years ago

One of the longest threads here ;) @aquamatthias, @kolloch ?

EvanKrall commented 8 years ago

@mikljohansson I generally like your idea, but drainDelay is less powerful than having a pluggable drain/is_safe_to_drain/stop_draining.

Currently we're using fixed delays, but for some types of tasks where shutdown takes a variable amount of time (e.g. queue workers where some jobs may take a long time), it would be nice to have a smarter drain method.

mikljohansson commented 8 years ago

@EvanKrall for sure the drainDelay setting would not allow for the flexible behaviour of pluggable state check callbacks where the app can decide when it's finished processing. Pluggable checks could also on average lead to quicker shutdown as the app can shutdown as soon as processing is done, instead of it always having to wait drainDelay seconds.

However drainDelay would certainly easier to use with existing apps since you wouldn't need to modify all your services and add HTTP servers and drain-check-endpoints to them. We (and I imagine others) run a fair amount of third party services on Marathon that does understand SIGTERM (since it's unix standardy). Just some examples could be RabbitMQ, Jenkins, Graphite/statsd/carbon, logstash-forwarder/indexer, .. And even if they do expose HTTP endpoints it's rarely feasibly to modify 3rd party software to have embedded Marathon specific endpoints.

drainDelay would possible be easier to implement in Marathon since it wouldn't need to make additional application HTTP calls, it's more about changing the internal task shutdown process.

With your specific case with workers that need to finish some long jobs, perhaps you can use the initial workaround I suggested. Since in this case you don't have an external proxy/load balancer that needs updating but perhaps rather the workers listen on a message queue. They can themselves decide when to disconnect from the queue and stop accepting new items, for example in response to a SIGTERM. With an external load balancer it's hard for services to themselves notify it that they're stopping, and with 3rd party services you can't really modify them to know about your service discovery solution.

With batch workers the shutdown process could perhaps use existing functionality like

EvanKrall commented 8 years ago

The drain/safe_to_kill/stop_draining endpoints wouldn't necessarily have to be implemented by your app - ideally these could call out to arbitrary URLs, like one exposed by your service discovery system, so you don't need to modify your apps.

I'm also not proposing changing the way Marathon kills tasks - presumably SIGTERM would still be sent once Marathon has decided that it's safe to kill that task.

mikljohansson commented 8 years ago

Good point about the callbacks not needing to be served by the app itself, that could for sure be useful in some setups.

In the case where one has HAproxy or another proxy like (Pen, ELB or F5) its less obvious how to detect or ask the proxy when the proxy has finished all outstanding requests/connections for a certain backend.

One might typically be running several proxies on multiple machines for high availability. In which case marathon would need to check all of them, or we'd need an intermediary "proxy orchestrator", leading to a bit more architectural complexity.

Yet another complication for us would be that we are using both central proxies for cluster-external communication, and per-mesos-slave proxies for in-cluster communication like I belive e.g. airbnb is doing. In cluster traffic are then routed directly between slave nodes, making each slave more of a nicely contained compute unit. And can potentially use availability zone aware weights on backends to decrease bandwidth costs On 22 Jan 2016 3:25 am, "Evan Krall" notifications@github.com wrote:

The drain/safe_to_kill/stop_draining endpoints wouldn't necessarily have to be implemented by your app - ideally these could call out to arbitrary URLs, like one exposed by your service discovery system, so you don't need to modify your apps.

I'm also not proposing changing the way Marathon kills tasks - presumably SIGTERM would still be sent once Marathon has decided that it's safe to kill that task.

— Reply to this email directly or view it on GitHub https://github.com/mesosphere/marathon/issues/712#issuecomment-173780542 .

EvanKrall commented 8 years ago

In the case of a distributed proxy system like SmartStack, you'd need a central place to initiate a drain, and a central place to check the status of draining. In our case, we use hacheck to manage draining. To detect the status of draining, you'd probably want to ask the node where the service is running to count how many connections are open against that container, etc.

The point is that these decisions are very specific to the environment where Marathon is deployed. While a fixed timer will be sufficient for 95% of cases, I think these few HTTP hooks would be powerful enough to support almost every advanced use case.

mikljohansson commented 8 years ago

Yep, very good point. It'd definitely be much more flexible and correct with the custom drain check callbacks.

Thanks for the hacheck link, didn't know about that design pattern! Could nicely enable sidekick health and drain checkers for services that don't support it themselves

brndnmtthws commented 8 years ago

FWIW, I merged https://github.com/mesosphere/marathon-lb/pull/51 into marathon-lb.

SEJeff commented 8 years ago

@brndnmtthws Thanks, this is a quite elegant way of doing it so long as you're using marathon-lb.

malterb commented 8 years ago

I have been trying to capture the SIGTERM, but so far I have been unsuccessful.

I am running a play framework app which is executed by marathon through sh -c ./bin/app and gets killed after immediately (although is set to executor_shutdown_grace_period=120s)

I have also tried it with a sigterm script:

#!/bin/bash
_term() {
  echo "Caught SIGTERM signal!"
  kill -TERM "$child" 2>/dev/null
}

trap _term SIGTERM

echo "Executing $@";
$@ &

child=$!
wait "$child"

which works locally, but not when executing via marathon. I get the following log output (above script is called trapsigterm.sh)

Sending SIGTERM to process tree at pid 29255
Killing the following process trees:
[ 
-+- 29255 sh -c ./trapsigterm.sh ./app*/bin/app -Dhttp.port=$PORT0 -mem 4900 -J-server 
 \-+- 29256 /bin/bash ./trapsigterm.sh ./app-0.1-SNAPSHOT/bin/app -Dhttp.port=21927 -mem 4900 -J-server 
   \--- 29257 java -Duser.dir=/tmp/mesos/slaves/....]
Command terminated with signal Terminated (pid: 29255)
application - Received shutdown command. Waiting for 60s to let outstanding requests finish
application - shutdown imminent: 60s left

and then it's terminated.

I would appreciate any help :)

sielaq commented 8 years ago

@elmalto do not use sh but bash due to fact that sh is not propagating signals. and be sure that your bash is not linked to sh.

malterb commented 8 years ago

@sielaq Thank you. But how do I set it? I have been trying to find documentation on how to do it. I simply pass app*/bin/app -Dhttp.port=$PORT0 -mem 4900 -J-server as a command to marathon and then marathon wraps it with sh -c.

sielaq commented 8 years ago

@elmalto I'm using docker container - so for that I'm using --entrypoint, https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/frameworks/ubuntu-oracle-java8/Dockerfile#L30 and then I'm using agrs only as an argument for script.

but you can try with args, something like: [ '/bin/bash', '-c', './trapsigterm.sh', './app*/bin/app -Dhttp.port=$PORT0 -mem 4900 -J-server' ]

sielaq commented 8 years ago

@elmalto you can also workaround it by linking sh to point to bash (on all slaves) just to give it a try.

malterb commented 8 years ago

@sielaq symlinking /bin/bash to /bin/sh did indeed solve the problem. But now I am getting `Process 6319 did not terminate after 3secs, sending SIGKILL to process tree at 6319``

Do I need to set `EXECUTOR_SIGNAL_ESCALATION_TIMEOUT``

as well?

executor_shutdown_grace_period is set to 120secs

sielaq commented 8 years ago

hmm I'm using --docker_stop_timeout, and --executor_registration_timeout but in your case executor_shutdown_grace_period should work.

malterb commented 8 years ago

From my understanding the hardcoded EXECUTOR_SIGNAL_ESCALATION_TIMEOUT is sending the kill -9 after 3s though. Is there any way to keep this from happening?

sielaq commented 8 years ago

@elmalto, I cannot help you at that point :( sorry

tysonnorris commented 8 years ago

BTW, recently learning about the "readinessChecks" introduced to marathon, it may also be useful to consider the shutdown side as the inverse of readinessChecks, like "terminationChecks". Of course non-http apps + third party apps will not be able to leverage that, but making these features related would be a useful way to go.

e.g. if I implement terminationChecks in my app, I could use that as a way to allow my app to signal that "it is going to be terminated and cleanup should begin" (which is different than "it is going to be terminated immediately"). In a typical case, this impl could be: remove it from the load balancer, but allow any in flight requests that have not arrived to still be serviced until the load balancer update is complete.

air commented 8 years ago

The new Mesos 0.29 KillPolicy was just added to Marathon, so you can specify how long in seconds Marathon will wait after sending a SIGTERM, before killing the process. See https://github.com/mesosphere/marathon/commit/4a6cd7afe66c91741d835a55933da67b72c331de

meichstedt commented 8 years ago

Just to clarify @air's comment: you can specify how long in seconds the executor will wait after sending a SIGTERM, before killing the process. If someone uses a custom executor, it's up to that executor's implementation to handle the grace period.

gkleiman commented 8 years ago

@aquamatthias @meichstedt isn't this solved by taskKillGracePeriodSeconds?

timoreimann commented 8 years ago

@gkleiman as brought up by @sttts in https://github.com/mesosphere/marathon/issues/712#issuecomment-59648280, is there already a way to emit a draining event to take tasks out of load balancing rotation (possibly through the failure of the readiness probe)?

As a Bamboo user, I find this complementary to the custom SIGTERM handler.

aquamatthias commented 8 years ago

@timoreimann The taskKillGracePeriod is the time between a task get a SIGTERM and waits for the task to die until a SIGKILL is send. The task is going into the TASK_KILLING state, so external tools (e.g. service discovery/load balancer) should not route to those tasks.

timoreimann commented 8 years ago

@aquamatthias thanks for the info. Apart from the changelog, is there already a page somewhere documenting the behavior?

aquamatthias commented 8 years ago

@timoreimann very good point - clear documentation is missing at that point. Even the Mesos documentation on that feature is poor. I created #4323 to get that fixed.

brndnmtthws commented 8 years ago

@aquamatthias has anyone validated that this works correctly? You could validate it based on a similar test I wrote for MLB:

https://github.com/mesosphere/marathon-lb/blob/master/tests/run-benchmark.rb https://github.com/mesosphere/marathon-lb/blob/master/tests/1-nginx.json https://github.com/mesosphere/marathon-lb/blob/master/tests/Dockerfile

Edit: I only ask because in my testing it does not appear to work. Where 'work' is defined as whether or not I see 500s with my simple nginx app.