[PROD] [CUSTOMER] Heathcheck for IMAP

chibenwa commented 3 weeks ago

Why?

reactor.core.Exceptions$ErrorCallbackNotImplemented: org.apache.james.imapserver.netty.ReactiveThrottler$RejectedException: The IMAP server has reached its maximum capacity (concurrent requests: 200, queue size: 4096)
Caused by: org.apache.james.imapserver.netty.ReactiveThrottler$RejectedException: The IMAP server has reached its maximum capacity (concurrent requests: 200, queue size: 4096)
    at org.apache.james.imapserver.netty.ReactiveThrottler.throttle(ReactiveThrottler.java:81)
    at org.apache.james.imapserver.netty.ImapChannelUpstreamHandler.channelRead(ImapChannelUpstreamHandler.java:373)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1475)
    at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1338)
    at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1387)
    at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:529)
    at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:468)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:290)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:93)
    at org.apache.james.imapserver.netty.HAProxyMessageHandler.channelRead(HAProxyMessageHandler.java:85)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:286)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
    at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
    at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at java.base/java.lang.Thread.run(Unknown Source)

One pod was stuck like this... So traffic was partially downgraded (1 pod failing). It do not happen a lot at all (3 month - 1 time)

While we should hopefully investigate seriously the root cause ( https://github.com/linagora/james-project/issues/5246 !) the topic is complex and I would like to have an operational alternative to counter this...

The idea: have a healthcheck that would be triggered and that we could aggregate in the liveness probe (CF https://github.com/linagora/james-project/issues/5244) the time that we actually fix the issue!

What

Add a heathcheck that for all IMAP servers ensures the reactive throttlers are not not.

Not full -> OK

one full -> degraded

quantranhong1999 commented 3 weeks ago

one full -> degraded

FYI k8s consider a pod not healthy when receiving the response code >= 400.

James healthcheck response code:

200: All checks have answered with a Healthy or Degraded status. James services can still be used.
503: At least one check have answered with a Unhealthy status

degraded with 200 code won't trigger k8s pod restart.

Then we should return unhealthy instead? Not sure if it is a bit harsh. Anyway, docker and k8s liveness checks allow a number of failures (failureThreshold defaults to 3) before restarting, therefore a bit more resiliency on the actual high IMAP load may be acceptable.

chibenwa commented 3 weeks ago

We could add a flag as a query parameter to consider degraded as failed. EG

GET 127.0.0.1:8000/healthcheck/checks/ImapCheck?strict

Would return 503 response code if unhealthy and degraded

While

GET 127.0.0.1:8000/healthcheck/checks/ImapCheck

Would return 503 when unhealthy and 200 for degraded.

We would need to implement GET 127.0.0.1:8000/healthcheck?strict too.

Whould this solve your concern @quantranhong1999 ? We would get the best of both worlds...

Maybe this shall be a separate issue? Do you want to open it @quantranhong1999 ?

quantranhong1999 commented 3 weeks ago

Maybe this shall be a separate issue? Do you want to open it @quantranhong1999 ?

https://github.com/linagora/james-project/issues/5249

hungphan227 commented 6 days ago

pr https://github.com/apache/james-project/pull/2401

linagora / james-project

[PROD] [CUSTOMER] Heathcheck for IMAP #5245

Why?

What