Closed guenter closed 8 years ago
There is a patch in flight that makes the timeout configurable: https://reviews.apache.org/r/25434/
Signals and few seconds are not easy to manage. For TCP based services, there may not be a better option but I don't know how this would work for docker tasks. For HTTP services, my suggestion would be to add an option, just like the health check, to leverage a shutdown URI. That way the application can take the necessary time to drain and possibly stop responding to health checks.
I linked a ticket from the bamboo haproxy project. It looks like we probably also need some kind of "draining" state of a task with corresponding status events, in order to make bamboo and similar projects behave well on shutdown of tasks.
@sttts that's a nice clean way to handle it.
It would be cool if there will be configurable pre-killing task like: reach instance:port/kill URL before killing - so we can implement better logic in instance application (for example we will mark self health checks as unhealthy -> it would get out from HAproxy) and of course this configurable timeout will be also needed.
Apps that are super-fast to deploy (like golang apps in docker containers on top of busybox image that are already downloaded) can result in state when all old tasks are killed, but no new tasks are present in service discovery because update didn't happen yet.
I think we could ease transition to mesos/marathon with some timeout for tasks when they are not exposed in marathon api for service discovery, but still running.
Hiding from api or marking as unhealthy before killing with configurable timeout would improve things a lot for apps with connect-request-response-disconnect apps (think php).
Clever apps can catch signals and do draining shutdown, with configurable timeout between SIGTERM and SIGKILL per app that should work pretty well.
Ideally I would like something like this:
Marathon asks instances to stop themselves by sending a rest request (PUT to the shutdown path). The response is could be 200 OK, or 202 Accepted for instances willing to shutdown, or 409 Conflict to indicate that instance would prefer not to shutdown. If marathon really, really wants the instance to stop it sends a DELETE to the shutdown path. This one must be obeyed. Alternatives are a command or TCP, just like health checking.
Marathon would use the configured timeout to ensure the task actually died out. If it did not die within the timeout, then it is killed off using mesos as happens now.
If not enough instances respond with 200 OK or 202 Accepted, then marathon could retry through the task set for a configurable amount of time, and finally, if there are still more to stop, use the DELETE to make it unconditional.
This kind of process would allow instances to drain any current workload and stop themselves in an orderly fashion. Not only that, some instances doing important work could effectively say "pick someone else please".
You could do it without the option for declining a shutdown. In effect, you are going to make shutdown mandatory, and wait a configured amount of time for it to happen in an orderly fashion, and after that timeout just follow the current procedure. But having the option to decline a shutdown allows you to find idle instances of services. An idle service can shutdown immediately with no consequences.
CC @BenWhitehead and @jsancio
As @bobrik said, I would stick with POSIX signals. Sending SIGTERM is a standard way to do this http://en.wikipedia.org/wiki/Unix_signal#POSIX_signals
The SIGTERM signal is sent to a process to request its termination. Unlike the SIGKILL signal, it can be caught and interpreted or ignored by the process. This allows the process to perform nice termination releasing resources and saving state if appropriate.
Applications controlled by other tools like http://supervisord.org/configuration.html#program-x-section-settings expects it
The number of seconds to wait for the OS to return a SIGCHILD to supervisord after the program has been sent a stopsignal. If this number of seconds elapses before supervisord receives a SIGCHILD from the process, supervisord will attempt to kill it with a final SIGKILL.
In addition, programming languages have standard mechanisms to handle shutdown http://docs.oracle.com/javase/7/docs/api/java/lang/Runtime.html#addShutdownHook%28java.lang.Thread%29
Futhermore, such a solution is universal and not limited to HTTP apps only.
Is Mesos doesn't do it this way? https://github.com/apache/mesos/blob/2985ae05634038b70f974bbfed6b52fe47231418/src/slave/flags.hpp#L139
see #1579
I would suggest that a graceful shutdown (e.g. to allow removal from a load balancer) would also be part of the transition of old instances during upgrade deployment workflows.
I'm not sure if that was part of the original intention of this issue, but it's the use case I'm researching to accomplish better availability during upgrades (without independently managing multiple marathon apps etc).
Currently trying to solve the graceful rolling upgrades problem and it'd help immensely if the upgrade & restart flow was adjusted as suggested in #1147, e.g.
Services need to trap and handle the SIGTERM signal gracefully, i.e. complete outstanding requests and exit nicely. Which as noted in previous comments is standard POSIX behavior anyway.
Currently trying to very sub-optimally solve the problem by
You know... if this hasn't been implemented, as a first stab, would you be willing to do the same thing that Apache Aurora does? This would certainly make it easy to move applications from Aurora to Marathon.
The Aurora documentation details how they do it. The gist of it is that the app needs to implement 3 things.
/health
http endpoint, which returns a http status code of 200 and the literal text of ok
/quitquitquit
endpoint which only accepts http POST
requests. When the scheduler (read Marathon) sends a POST here, the application will start a graceful shutdown and immediately start failing any healthcheck to /health
/abortabortabort
endpoint, which only accepts http POST
requests. Ten seconds from when /quitquitquit
is signaled, a second http POST
request is made to /abortabortabort
, which triggers the application to shut down immediately./abortabortabort
, then the standard mesos SIGTERM, wait, SIGKILL conventions are followed.Then in the aurora configuration for that application, it needs to specify a health
port, which is the port where these http requests will be made to.
I've got an Aurora aware load balancer, aurproxy, doing http HEAD
requests to the /health
endpoint every 3 seconds. This results in Aurora being able to do rolling restarts and the load balancer draining instances before they are forcibly killed.
Note that the above three urls are configurable, and there is a patch in the works for the timeouts to be as well. It would be great if Marathon followed this same pattern which, IMO, works beautifully.
@mikljohansson - Did increasing the Mesos grace period work for you? We are looking at tackling it that way for now as well.
Thanks for your thoughts!
My ideal of a graceful shutdown sequence would have these properties:
An implementation which satisfies those could look like this: An app definition could include a set of commands / URIs (much like healthcheck cmd/uri) that Marathon would use to control this process:
drain
: this command / URI would be run / POSTed to when a task is scheduled to be killed. This could remove the task from load balancers, etc.is_safe_to_kill
: this command/URI would be run/GET from, and the return code would decide whether Marathon will kill the task yet. (Otherwise, it will poll again later.)stop_draining
: this is run/POSTed to when we decide to not kill an instance after all.For convenience, the is_safe_to_kill
check should probably get passed the timestamp when draining started, so you could easy implement timeout-based draining. (Remove task from load balancer, wait for N seconds)
Marathon, when it wants to kill a task, would first call drain
, then poll is_safe_to_kill
until that returns true, then kill the task.
Does anyone on this thread have a working fork they are using today? We're struggling to get a working rev that has a different shutdown grace period. Thanks!
@adomeier sorry for being so slow to respond. We hadn't actually gotten around to trying out the workaround. I've done so now and it does work really well actually. I've written about what's needed to make it work on this page
https://github.com/meltwater/proxymatic#rolling-upgradesrestarts
In essence
Thanks @mikljohansson ! That's super interesting. We ended up getting around it for now as a lot of our apps that maintain state actually spawn a child process so killing the parent allows our child jobs to finish gracefully with out the kill interrupting the transaction. We're continue down an active integration with the tech stack tho and I'm sure we'll see other issues with other apps. Really appreciate your response it's very helpful!
CC - @jolexa
Yea we do exactly the same in our project, we trap SIGTERM https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/frameworks/start.sh#L59 which means that each container have to be started with wrapper.
The workaround to trap SIGTERM and start failing the health check only works when having full control over the services, we have a lot of both our own and 3rd party services. It's also rather involved to implement in some frameworks/languages we've noticed. Neither is it very immediate, since the load balancers needs to wait up to HealthCheckDefinition.intervalSeconds
to detect that the task is draining.
Our motivation for is that we'd like the ability to continuously roll out new app versions without interruption to service nor fail any requests while the rollout happens. To solve this without having to modify every single service I've investigated some possible routes. An idea I'd very much appreciate feedback on is:
Use a task change event (via normal event bus) to detect immediately when Marathon calls driver.killTask()
and remove the task from the load balancers. For example by adding a int64 drained_at
field to MarathonTask and set it when driver.killTask()
is called.
This drain-and-update-status could be abstracted being a TaskTracker.drainTask(...)
method that updates the drainedAt
as a normal task state change, which would then trigger an event to be distributed.
driver.killTask()
are somewhat spread out in source and tests, so would need some refactoring. But then perhaps not all the driver.killTask()
calls need go via TaskTracker.drainTask()
, modifying only the TaskKillActor
that takes part in rolling upgrades could be a great start.This leaves the problem of giving the load balancers some seconds remove the task and prevent new connections from being sent there, before the task is actually sent the SIGTERM. This could be solved using a wrapper script like @sielaq referenced, but even better would be a per-app configurable delay between TaskTracker.drainTask()
and driver.killTask()
.
This delay could perhaps be implemented by TaskTracker.determineOverdueTasks()
called from KillOverdueTasksActor
, which would scan for and kill tasks whose drainDelay
has elapsed. This would hopefully ensure that Marathon failures like #2327 had issues with wouldn't impact the draining behaviour.
Some pseudo-code that tries to express this
marathon.proto
message ServiceDefinition {
...
optional int64 drainDelay = 26 [default = 0]; // stored as Millis
}
message MarathonTask {
...
optional int64 drained_at = 12;
}
TaskTracker.scala
def getDefinition(taskId: String): ServiceDefinition = ...
def drainTask(taskId: String) {
val builder = getTask(taskId).toBuilder
builder.setDrainedAt(System.currentTimeMillis)
updateTask(builder.build())
if getDefinition(taskId).drainDelay == 0 {
driver.killTask(taskId)
}
}
def determineOverdueTasks(now: Timestamp): Iterable[MarathonTask] = {
...
else if (task.hasDrainedAt && task.getDrainedAt < (nowMillis - getDefinition(task.getId).drainDelay)) {
true
}
}
One of the longest threads here ;) @aquamatthias, @kolloch ?
@mikljohansson I generally like your idea, but drainDelay
is less powerful than having a pluggable drain
/is_safe_to_drain
/stop_draining
.
Currently we're using fixed delays, but for some types of tasks where shutdown takes a variable amount of time (e.g. queue workers where some jobs may take a long time), it would be nice to have a smarter drain method.
@EvanKrall for sure the drainDelay
setting would not allow for the flexible behaviour of pluggable state check callbacks where the app can decide when it's finished processing. Pluggable checks could also on average lead to quicker shutdown as the app can shutdown as soon as processing is done, instead of it always having to wait drainDelay seconds.
However drainDelay
would certainly easier to use with existing apps since you wouldn't need to modify all your services and add HTTP servers and drain-check-endpoints to them. We (and I imagine others) run a fair amount of third party services on Marathon that does understand SIGTERM (since it's unix standardy). Just some examples could be RabbitMQ, Jenkins, Graphite/statsd/carbon, logstash-forwarder/indexer, .. And even if they do expose HTTP endpoints it's rarely feasibly to modify 3rd party software to have embedded Marathon specific endpoints.
drainDelay
would possible be easier to implement in Marathon since it wouldn't need to make additional application HTTP calls, it's more about changing the internal task shutdown process.
With your specific case with workers that need to finish some long jobs, perhaps you can use the initial workaround I suggested. Since in this case you don't have an external proxy/load balancer that needs updating but perhaps rather the workers listen on a message queue. They can themselves decide when to disconnect from the queue and stop accepting new items, for example in response to a SIGTERM. With an external load balancer it's hard for services to themselves notify it that they're stopping, and with 3rd party services you can't really modify them to know about your service discovery solution.
With batch workers the shutdown process could perhaps use existing functionality like
The drain/safe_to_kill/stop_draining endpoints wouldn't necessarily have to be implemented by your app - ideally these could call out to arbitrary URLs, like one exposed by your service discovery system, so you don't need to modify your apps.
I'm also not proposing changing the way Marathon kills tasks - presumably SIGTERM would still be sent once Marathon has decided that it's safe to kill that task.
Good point about the callbacks not needing to be served by the app itself, that could for sure be useful in some setups.
In the case where one has HAproxy or another proxy like (Pen, ELB or F5) its less obvious how to detect or ask the proxy when the proxy has finished all outstanding requests/connections for a certain backend.
One might typically be running several proxies on multiple machines for high availability. In which case marathon would need to check all of them, or we'd need an intermediary "proxy orchestrator", leading to a bit more architectural complexity.
Yet another complication for us would be that we are using both central proxies for cluster-external communication, and per-mesos-slave proxies for in-cluster communication like I belive e.g. airbnb is doing. In cluster traffic are then routed directly between slave nodes, making each slave more of a nicely contained compute unit. And can potentially use availability zone aware weights on backends to decrease bandwidth costs On 22 Jan 2016 3:25 am, "Evan Krall" notifications@github.com wrote:
The drain/safe_to_kill/stop_draining endpoints wouldn't necessarily have to be implemented by your app - ideally these could call out to arbitrary URLs, like one exposed by your service discovery system, so you don't need to modify your apps.
I'm also not proposing changing the way Marathon kills tasks - presumably SIGTERM would still be sent once Marathon has decided that it's safe to kill that task.
— Reply to this email directly or view it on GitHub https://github.com/mesosphere/marathon/issues/712#issuecomment-173780542 .
In the case of a distributed proxy system like SmartStack, you'd need a central place to initiate a drain, and a central place to check the status of draining. In our case, we use hacheck to manage draining. To detect the status of draining, you'd probably want to ask the node where the service is running to count how many connections are open against that container, etc.
The point is that these decisions are very specific to the environment where Marathon is deployed. While a fixed timer will be sufficient for 95% of cases, I think these few HTTP hooks would be powerful enough to support almost every advanced use case.
Yep, very good point. It'd definitely be much more flexible and correct with the custom drain check callbacks.
Thanks for the hacheck link, didn't know about that design pattern! Could nicely enable sidekick health and drain checkers for services that don't support it themselves
FWIW, I merged https://github.com/mesosphere/marathon-lb/pull/51 into marathon-lb.
@brndnmtthws Thanks, this is a quite elegant way of doing it so long as you're using marathon-lb
.
I have been trying to capture the SIGTERM, but so far I have been unsuccessful.
I am running a play framework app which is executed by marathon through sh -c ./bin/app
and gets killed after immediately (although is set to executor_shutdown_grace_period=120s)
I have also tried it with a sigterm script:
#!/bin/bash
_term() {
echo "Caught SIGTERM signal!"
kill -TERM "$child" 2>/dev/null
}
trap _term SIGTERM
echo "Executing $@";
$@ &
child=$!
wait "$child"
which works locally, but not when executing via marathon. I get the following log output (above script is called trapsigterm.sh)
Sending SIGTERM to process tree at pid 29255
Killing the following process trees:
[
-+- 29255 sh -c ./trapsigterm.sh ./app*/bin/app -Dhttp.port=$PORT0 -mem 4900 -J-server
\-+- 29256 /bin/bash ./trapsigterm.sh ./app-0.1-SNAPSHOT/bin/app -Dhttp.port=21927 -mem 4900 -J-server
\--- 29257 java -Duser.dir=/tmp/mesos/slaves/....]
Command terminated with signal Terminated (pid: 29255)
application - Received shutdown command. Waiting for 60s to let outstanding requests finish
application - shutdown imminent: 60s left
and then it's terminated.
I would appreciate any help :)
@elmalto do not use sh
but bash
due to fact that sh
is not propagating signals.
and be sure that your bash
is not linked to sh
.
@sielaq Thank you. But how do I set it? I have been trying to find documentation on how to do it. I simply pass app*/bin/app -Dhttp.port=$PORT0 -mem 4900 -J-server
as a command to marathon and then marathon wraps it with sh -c.
@elmalto I'm using docker container - so for that I'm using --entrypoint
,
https://github.com/eBayClassifiedsGroup/PanteraS/blob/master/frameworks/ubuntu-oracle-java8/Dockerfile#L30 and then I'm using agrs
only as an argument for script.
but you can try with args
, something like:
[ '/bin/bash', '-c', './trapsigterm.sh', './app*/bin/app -Dhttp.port=$PORT0 -mem 4900 -J-server' ]
@elmalto you can also workaround it by linking sh
to point to bash
(on all slaves) just to give it a try.
@sielaq symlinking /bin/bash to /bin/sh did indeed solve the problem. But now I am getting `Process 6319 did not terminate after 3secs, sending SIGKILL to process tree at 6319``
Do I need to set `EXECUTOR_SIGNAL_ESCALATION_TIMEOUT``
as well?
executor_shutdown_grace_period
is set to 120secs
hmm I'm using --docker_stop_timeout
, and --executor_registration_timeout
but in your case executor_shutdown_grace_period
should work.
From my understanding the hardcoded EXECUTOR_SIGNAL_ESCALATION_TIMEOUT
is sending the kill -9 after 3s though. Is there any way to keep this from happening?
@elmalto, I cannot help you at that point :( sorry
BTW, recently learning about the "readinessChecks" introduced to marathon, it may also be useful to consider the shutdown side as the inverse of readinessChecks, like "terminationChecks". Of course non-http apps + third party apps will not be able to leverage that, but making these features related would be a useful way to go.
e.g. if I implement terminationChecks in my app, I could use that as a way to allow my app to signal that "it is going to be terminated and cleanup should begin" (which is different than "it is going to be terminated immediately"). In a typical case, this impl could be: remove it from the load balancer, but allow any in flight requests that have not arrived to still be serviced until the load balancer update is complete.
The new Mesos 0.29 KillPolicy was just added to Marathon, so you can specify how long in seconds Marathon will wait after sending a SIGTERM, before killing the process. See https://github.com/mesosphere/marathon/commit/4a6cd7afe66c91741d835a55933da67b72c331de
Just to clarify @air's comment: you can specify how long in seconds the executor will wait after sending a SIGTERM, before killing the process. If someone uses a custom executor, it's up to that executor's implementation to handle the grace period.
@aquamatthias @meichstedt isn't this solved by taskKillGracePeriodSeconds
?
@gkleiman as brought up by @sttts in https://github.com/mesosphere/marathon/issues/712#issuecomment-59648280, is there already a way to emit a draining event to take tasks out of load balancing rotation (possibly through the failure of the readiness probe)?
As a Bamboo user, I find this complementary to the custom SIGTERM handler.
@timoreimann The taskKillGracePeriod is the time between a task get a SIGTERM and waits for the task to die until a SIGKILL is send. The task is going into the TASK_KILLING state, so external tools (e.g. service discovery/load balancer) should not route to those tasks.
@aquamatthias thanks for the info. Apart from the changelog, is there already a page somewhere documenting the behavior?
@timoreimann very good point - clear documentation is missing at that point. Even the Mesos documentation on that feature is poor. I created #4323 to get that fixed.
@aquamatthias has anyone validated that this works correctly? You could validate it based on a similar test I wrote for MLB:
https://github.com/mesosphere/marathon-lb/blob/master/tests/run-benchmark.rb https://github.com/mesosphere/marathon-lb/blob/master/tests/1-nginx.json https://github.com/mesosphere/marathon-lb/blob/master/tests/Dockerfile
Edit: I only ask because in my testing it does not appear to work. Where 'work' is defined as whether or not I see 500s with my simple nginx app.
The current shutdown sequence in Mesos is to send a TERM, followed by a KILL 3 seconds later if the process is still running. Some apps have more advanced requirements like deregistering from a load balancer before shutting down. This thread is to collect requirements for a graceful shutdown feature.