linagora / james-project

Mirror of Apache James Project
Apache License 2.0
70 stars 63 forks source link

SMTP gracefull shutdown is FFFFFF up #5154

Closed chibenwa closed 3 weeks ago

chibenwa commented 3 months ago
org.apache.james.protocols.netty.BasicChannelInboundHandler Unable to process request
java.util.concurrent.RejectedExecutionException: event executor terminated
    at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:934)
    at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:351)
    at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:344)
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:836)
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute0(SingleThreadEventExecutor.java:827)
    at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:817)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:469)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelReadComplete(AbstractChannelHandlerContext.java:456)
    at io.netty.handler.codec.ByteToMessageDecoder.channelReadComplete(ByteToMessageDecoder.java:358)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelReadComplete(AbstractChannelHandlerContext.java:486)
    at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:61)
    at io.netty.channel.AbstractChannelHandlerContext$Tasks$1.run(AbstractChannelHandlerContext.java:1282)
    at io.netty.util.concurrent.AbstractEventExecutor.runTask(AbstractEventExecutor.java:173)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:166)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasksFrom(SingleThreadEventExecutor.java:426)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:375)
    at io.netty.util.concurrent.SingleThreadEventExecutor.confirmShutdown(SingleThreadEventExecutor.java:763)
    at io.netty.util.concurrent.DefaultEventExecutor.run(DefaultEventExecutor.java:70)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Unknown Source)

Caused by https://github.com/apache/james-project/pull/2167

To be fair I am unsure what to do:

quantranhong1999 commented 1 month ago

It is not impossible that ignoring such exceptions might be OK

This java.util.concurrent.RejectedExecutionException: event executor terminated is acceptable IMO as it was likely a race condition during the graceful shutdown process.

From ChatGPT (of course, we should not believe it 100% but I think it makes sense for our case):

The `RejectedExecutionException` during the shutdown process is a common issue when tasks are submitted after the event executors have begun shutting down.

If there are multiple threads involved in your application and some tasks are submitted concurrently with the shutdown process, it's possible for a task to be submitted after shutdownGracefully() has been called but before the event executor has completed its shutdown sequence.

Also the EventExecutorGroup.shutdownGracefully() has a timeout to shutdown (https://netty.io/4.1/api/io/netty/util/concurrent/EventExecutorGroup.html#shutdownGracefully-long-long-java.util.concurrent.TimeUnit-):

     * @param timeout     the maximum amount of time to wait until the executor is {@linkplain #shutdown()}
     *                    regardless if a task was submitted during the quiet period

IMO the EventExecutorGroup received a few more tasks during the quiet period, it tried to process them but when the timeout is reached, the EventExecutorGroup is forced to be shutdown therefore the remaining requests are rejected with the above error because the EventExecutorGroup is not running anymore.

For short: likely the Netty executor can not be more graceful (after a timeout) and decided to reject the pending request. No big deal.

quantranhong1999 commented 1 month ago

To be fair I am unsure what to do: if the eventLoop is shut down first then the executor cannot send replies back to the client. if the executor is shut down first then the event loop might receive network requests it cannot handle.

I can not search for the best practices for this case (not many materials available).

IMO shutting the executor first would be better: the event loop might receive network requests it cannot handle -> the client is aware of the failure (connection close?) and can retry later. This sounds fine to me.

If we shut down the event loop first: the executor cannot send replies back to the client -> the client is not aware of the actual result and may attempt a retry (even if the executor finished the request), then the client may send 2 duplicated mails for example. This sounds misleading to me.

WDYT?

chibenwa commented 4 weeks ago

IMO shutting the executor first would be better

This is what is done today BTW

Thanks for getting a look.

Maybe we could...

quantranhong1999 commented 4 weeks ago

This is what is done today BTW

Actually, I found that we are shutting down the event loop first. https://github.com/apache/james-project/pull/2313

quantranhong1999 commented 4 weeks ago

or even upon shutdown have BasicChannelInboundHandler silent the exception (in which case BasicChannelInboundHandler needs to be aware of the shutdown).

Please if this is what you want ^^. Actually that throw exception behavior is configurable.