Closed jgslima closed 1 year ago
Probably the implementation might make ExecutorConfigurationSupport
to implement SmartLifecycle
.
In my opinion, the implementation of stop
does not have to be assynchronous. So, only the stop()
method would be implemented (leaving stop(Runnable callback)
with its default, synchronous implementation).
In the stop()
method, the ExecutorService
would be asked to shut down (but without calling here awaitTerminationIfNecessary()
). This would make all thread pools to start their shutdown. And, in web applications that use graceful shutdown, this would happen in parallel with the graceful shutdown blocking as well.
After, the destroy()
would see that the ExecutorService
is already shut down and would call executor.awaitTermination()
. In this way, the calls to executor.awaitTermination()
would not happen in parallel, but it does not really matter, since all thread pools will be finishing their tasks in parallel. The total time took to all the calls to destroy()
would be, roughly, the greatest (worse) time took by some of the pools.
Hello people. Any thoughts on this? Thanks.
@rstoyanchev , I do not want to contribute to this vast amount of issues in the spring project. As it seems, the community creates an amount of issues greater than the capacity of yours to analyze them.
Feel free to just close the issue.
Spring team, I am considering closing this issue, unless you want to keep it open. Did you have chance to analyze this?
Here, the lenghty application shutdown remains.
Thanks.
If I work on the code and submit a PR for this, it would be considered?
In the context of Project CRaC support, we are revisiting our own lifecycle adapters, and we ran into the executor/scheduler implementations as well. It is somewhat more complicated with stop/restart having to be possible, so a shutdown on stop()
would have to be followed by a recreation of the thread pool on start()
, or we'd have to use a different trigger to stop the currently running tasks while keeping the thread pool alive.
We'll try to sort this out for our 6.1.
The challenge with the Lifecycle
contract is that it serves several purposes, and even one more now with Checkpoint Restore. ThreadPoolTaskExecutor
and in particular ThreadPoolTaskScheduler
are not really ideal candidates for implementing Lifecycle
since the underlying ThreadPoolExecutor
has no stop or restart functionality, just shutdown. So whenever a Lifecycle.stop()
call comes in but later followed up by Lifecycle.start()
, we could only really reinitialize the ThreadPoolExecutor
completely, losing all of it previously scheduled tasks... which we'd have to re-schedule somehow.
Up until this point, we treated ThreadPoolTaskExecutor
and ThreadPoolTaskScheduler
as general delegates which do not participate in lifecycle management on their own. Instead, whoever submits tasks to the executor/scheduler is responsible for pausing/cancelling/rescheduling them whenever necessary. For graceful shutdown purposes, such callers could implement Lifecycle
themselves and stop their tasks early, so that the target executor does not have any actual tasks left whenever it eventually gets disposed. This is the approach that we applied ourselves in DefaultMessageListenerContainer
and co.
For Project CRaC, it is not actually necessary to stop/restart thread pools since the active threads are a part of the checkpoint image anyway. So we currently only stop DefaultMessageListenerContainer
and co, winding down its async workers but not the target thread pool that it is using. That said, it would not hurt to stop and restart thread pools themselves as well which is what we're currently investigating. If such a thread pool restart turns out to be impossible, a regular Lifecycle
implementation would not be appropriate since it would receive stop calls in restart scenarios as well, not just for graceful shutdown.
@jgslima what kind of tasks are typically left to await termination for in your scenario, and how did those tasks get submitted? Is this about late-submitted one-off tasks during the shutdown process, or rather about periodic tasks kicking in once more?
In terms of an early shutdown signal for executors and schedulers, it does not have to come from Lifecycle
, it could also come from ContextClosedEvent
which arrives even before the lifecycle stop calls (and does not have any restart semantics attached to it). This would effectively run in parallel with SmartLifecycle
stopping, awaiting actual termination in the destruction step then.
Experimenting with this a bit, we seem to be able to implement the general Lifecycle
stop/start semantics with a pause/resume approach in (Scheduled)ThreadPoolExecutor
subclasses (along the lines of the PausableThreadPoolExecutor suggested in the ThreadPoolExecutor javadoc) which would address our Checkpoint Restore requirements as well.
This combines nicely with a ContextClosedEvent
-driven early shutdown signal, addressing graceful shutdown scenarios then (which are otherwise not covered by such a pause/resume approach for stop/start). For ThreadPoolTaskExecutor
and ThreadPoolTaskScheduler
, we should be able to provide an out-of-the-box solution for all those scenarios in 6.1.
Hello.
It is somewhat more complicated with stop/restart having to be possible
Good point. I had not thought about it.
it could also come from
ContextClosedEvent
which arrives even before the lifecycle stop calls.
Ok, that seems fine. It also solves another point, that is, if we really would implement SmartLifecycle
attached to the same Phase of the web servers, I do not know how we would be able to guarantee the same phase without including a reference to the phase of WebServerGracefulShutdownLifecycle
(currently SmartLifecycle.DEFAULT_PHASE - 1024
), which is a spring-boot class.
Instead, whoever submits tasks to the executor/scheduler is responsible for pausing/cancelling/rescheduling them whenever necessary.
I understand.
@jgslima what kind of tasks are typically left to await termination for in your scenario, and how did those tasks get submitted? Is this about late-submitted one-off tasks during the shutdown process, or rather about periodic tasks kicking in once more?
It is not about late-submitted tasks.
Some examples of thread pools used in some applications I work with:
KafkaTemplate.send()
(such callbacks cannot be done in the main Kafka producer thread to avoid blocking it).@EventListener
(in cases where the event publishing was configured to be always async).@Scheduled
@Async
.My little "obsession" to have a graceful shutdown with the most possible efficiency, is to avoid loose database locks when some task is aborted abruptly. These create some headaches (MySQL by default only kills inactive sessions after 8 hours).
It is true that really long running tasks (like a full batch process) should not be run with any of the means above. For instance, in Kubernetes, a long running task should be submitted as a Job, which is not asked to be stopped before its conclusion.
However in some cases above, the task may involve doing something potentially slow, like making a remote call. When this is the case, ideally we would like to configure awaitTerminationMillis
with actually some seconds. But when you have so many executors being stopped sequentially, and only after the web server, it does not really help to configure many seconds to each (say, 6 or 8), because the sum would be greater than the default gracePeriod
of Kubernetes.
I would love to help with with this change, if you want, please tell me.
Thanks
@jhoeller , since you will change ThreadPoolTaskExecutor
(or some superclass), could you please make another change? To make ThreadPoolTaskExecutor
to expose the size of the internal ThreadPoolExecutor
's queue.
Something like int getTaskQueueSize()
that returns threadPoolExecutor.getQueue().size()
.
We miss this in order to be able to provide a Gauge
metric to be able to monitor the queue size in order to identify undersized pools.
I think it is better to expose just the size and not the Queue as whole, to avoid direct interactions with it.
Thanks
@jgslima the queue size has recently been exposed as getQueueSize()
already: #28583
For #24497, I'm introducing an initiateShutdown()
method with always-non-blocking semantics which is what we trigger on ContextClosedEvent
now but which can also be publicly called.
@jgslima it would be great if you could give the upcoming 6.1.0-M2
release a try! This tries to address a wide range of requirements now and hopefully improves your graceful shutdown scenarios out of the box.
You bet, I will give a try and get back to you.
@jhoeller, sorry for taking so long.
I tested it and at first I did not really get it, then I saw the new code and understood.
I have 2 comments:
the executor stops accepting new tasks as soon as the ContextClosedEvent
is published, which seems good. We may have a side effect, that will be RejectedExecutionException
s being thrown to the components that keeps submitting tasks to the executors, if these components themselves do not also react to the ContextClosedEvent
. Ideally these task submitters should also react to the event. Right now I was not able to recap and experiment with some spring components like the containers (JMS, Kafka, Rabbit, SQS), or the mechanisms used by @Async
and @Scheduled
. Not sure if the change will cause new ERROR
logs that may confuse people. In fact, I might have though about this when I opened the issue in the first place, but just realized now. What do you think? I may dedicate more time here to analyze and/or test this.
in the new initiateShutdown()
method, it is clear that you wanted the scope of the method to be just stop accepting new tasks. But, if waitForTasksToCompleteOnShutdown
is false
(and it should be false
in the majority of scenarios when we want a graceful shutdown, since the task queue may have a large amount of submitted tasks), at this point I would prefer to call executor.shutdownNow()
, to halt the processing of waiting tasks right away. After all, we already know at this point that the context is closing... so for a graceful shutdown, I would not want any pending task to start execution after this point. But I can see that ExecutorConfigurationSupport
attempts to actually cancel the pending tasks returned by shutdownNow()
. At first, I am not sure how to properly handle this in the event listener.
Please tell me if I can help somehow.
My mistake. The containers, like you said, use executors which are not context beans. But this is not the case for @Scheduled
and @Async
. I will experiment further.
Applications that need to perform graceful shutdown of tasks submitted to
ThreadPoolTaskExecutor
and/orThreadPoolTaskScheduler
, may make use of the settingawaitTerminationMillis
(and possiblywaitForTasksToCompleteOnShutdown
).However, application may take a very long time to actually finish if these conditions apply:
ThreadPoolTaskExecutor
andThreadPoolTaskScheduler
(and possibly multipleThreadPoolTaskExecutor
s).SmartLifecycle
beans which implements lenghty stopping.Real examples are web applications that use such components and make use of the web container "graceful shutdown" feature of Spring Boot. The overall termination sequence of an application is:
SmartLifecycle
asynchronous stopping triggers the web container graceful shutdown.SmartLifecycle
asynchronous stopping blocks and waits for the web container shutdown.DisposableBean
/@PreDestroy
methods, so, say:ThreadPoolTaskExecutor
'sdestroy()
is called, blocking and awaiting for the tasks to finish. If the application uses multipleThreadPoolTaskExecutor
s, for each one of them an awaiting occurs.ThreadPoolTaskScheduler
'sdestroy()
is called, blocking and awaiting for the tasks to finish.The proposal here is to create the possibility to have some form to make all the pools to finish their tasks in parallel. Ideally, in parallel with the stopping of other
SmartLifecycle
beans.In Kubernetes spring web applications that fits in the scenario above, the total time took by a Pod to finish ends up being too high (which is aggravated due to the need to configure a preStop hook to give time to kubeproxy to note the pod deletion). This has real effects: as, rigthly, in a rollout Kubernetes does not wait for old Pods to actually finish in order to create new ones, applications with a large number of pods and with a large termination time, end up having a large number of Pods actually running in the rollout (new Pods and many still being terminated). We have seen this triggering a cluster auto scale to make the cluster able to handle this large number of Pods that occur during the rollout.