linagora / james-project

Mirror of Apache James Project
Apache License 2.0
70 stars 63 forks source link

Monitor count of S3 HTTP connection #5178

Closed chibenwa closed 1 month ago

chibenwa commented 2 months ago

A common cause of issue regarding the HTTP client is:

Caused by: java.lang.Throwable: Acquire operation took longer than the configured maximum time. This indicates that a request cannot get a connection from the pool within the specified maximum time. This can be due to high request rate.
Consider taking any of the following actions to mitigate the issue: increase max connections, increase acquire timeout, or slowing the request rate.
Increasing the max connections can increase client throughput (unless the network interface is already fully utilized), but can eventually start to hit operation system limitations on the number of file descriptors used by the process. If you already are fully utilizing your network interface or cannot further increase your connection count, increasing the acquire timeout gives extra time for requests to acquire a connection before timing out. If the connections doesn't free up, the subsequent requests will still timeout.
If the above mechanisms are not able to fix the issue, try smoothing out your requests so that large traffic bursts cannot overload the client, being more efficient with the number of times you need to call AWS, or by increasing the number of hosts sending requests.
at software.amazon.awssdk.http.nio.netty.internal.utils.NettyUtils.decorateException (NettyUtils.java:69 )
at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.handleFailure (NettyRequestExecutor.java:307 )
at software.amazon.awssdk.http.nio.netty.internal.NettyRequestExecutor.makeRequestListener (NettyRequestExecutor.java:188 )
at io.netty.util.concurrent.DefaultPromise.notifyListener0 (DefaultPromise.java:590 )
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow (DefaultPromise.java:557 )
at io.netty.util.concurrent.DefaultPromise.access $200(DefaultPromise.java:35 )
at io.netty.util.concurrent.DefaultPromise $1.run(DefaultPromise.java:503 )
at io.netty.util.concurrent.AbstractEventExecutor.runTask (AbstractEventExecutor.java:173 )
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute (AbstractEventExecutor.java:166 )
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks (SingleThreadEventExecutor.java:470 )
at io.netty.channel.nio.NioEventLoop.run (NioEventLoop.java:569 )
at io.netty.util.concurrent.SingleThreadEventExecutor $4.run(SingleThreadEventExecutor.java:997 )
at io.netty.util.internal.ThreadExecutorMap $2.run(ThreadExecutorMap.java:74 )
... 1 more
Caused by: java.util.concurrent.TimeoutException: Acquire operation took longer than 10000 milliseconds.
at software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.timeoutAcquire (HealthCheckedChannelPool.java:77 )
at software.amazon.awssdk.http.nio.netty.internal.HealthCheckedChannelPool.lambda $acquire$0(HealthCheckedChannelPool.java:67 )
at io.netty.util.concurrent.PromiseTask.runTask (PromiseTask.java:98 )
at io.netty.util.concurrent.ScheduledFutureTask.run (ScheduledFutureTask.java:153 )
... 7 more

(occured today before lunch on tmail.linagora.com)

Add the following metrics:

-> Pending count of JMAP downloads (as those uses a S3 connection)

-> Find a way to expose the count of S3 connection in use.

This would enable to:

quantranhong1999 commented 2 months ago

Could you share the (blob store) metrics as well?

S3 connections may be slow. (OVH network issue recently)

chibenwa commented 2 months ago

Slow with 1000 // connection is not that much of an issue TBH.

But here you are nonetheless

blobStore:readBytes{quantile="0.5",} 0.039321599000000006
blobStore:readBytes{quantile="0.75",} 0.067633151
blobStore:readBytes{quantile="0.95",} 0.125304831
blobStore:readBytes{quantile="0.98",} 0.15833497500000002
blobStore:readBytes{quantile="0.99",} 0.18454937500000002
blobStore:readBytes{quantile="0.999",} 0.329252863
blobStore:readBytes_count 49737.0
# HELP blobStore:delete Generated from Dropwizard metric import (metric=blobStore:delete, type=com.codahale.metrics.Timer)
# TYPE blobStore:delete summary
blobStore:delete{quantile="0.5",} 2.9055000000000002E-5
blobStore:delete{quantile="0.75",} 3.9679000000000004E-5
blobStore:delete{quantile="0.95",} 5.7087000000000004E-5
blobStore:delete{quantile="0.98",} 9.574300000000001E-5
blobStore:delete{quantile="0.99",} 9.7279E-5
blobStore:delete{quantile="0.999",} 1.02911E-4
blobStore:delete_count 131.0
# HELP blobStore:saveInputStream Generated from Dropwizard metric import (metric=blobStore:saveInputStream, type=com.codahale.metrics.Timer)
# TYPE blobStore:saveInputStream summary
blobStore:saveInputStream{quantile="0.5",} 0.134217727
blobStore:saveInputStream{quantile="0.75",} 0.18245222300000002
blobStore:saveInputStream{quantile="0.95",} 0.570425343
blobStore:saveInputStream{quantile="0.98",} 0.6417285110000001
blobStore:saveInputStream{quantile="0.99",} 0.7381975030000001
blobStore:saveInputStream{quantile="0.999",} 2.768240639
blobStore:saveInputStream_count 1207.0
# HELP blobStoreCacheMisses_total Generated from Dropwizard metric import (metric=blobStoreCacheMisses, type=com.codahale.metrics.Meter)
# TYPE blobStoreCacheMisses_total counter
blobStoreCacheMisses_total 603.0
# HELP deletedMessageVault:blobStore:append Generated from Dropwizard metric import (metric=deletedMessageVault:blobStore:append, type=com.codahale.metrics.Timer)
1# TYPE deletedMessageVault:blobStore:append summary
0deletedMessageVault:blobStore:append{quantile="0.5",} 0.17616076700000002
0deletedMessageVault:blobStore:append{quantile="0.75",} 0.22334668700000002
 deletedMessageVault:blobStore:append{quantile="0.95",} 0.5620367350000001
8deletedMessageVault:blobStore:append{quantile="0.98",} 2.785017855
8deletedMessageVault:blobStore:append{quantile="0.99",} 3.170893823
1deletedMessageVault:blobStore:append{quantile="0.999",} 3.170893823
8deletedMessageVault:blobStore:append_count 53.0
8    0 # HELP blobStore:saveBytes Generated from Dropwizard metric import (metric=blobStore:saveBytes, type=com.codahale.metrics.Timer)
8# TYPE blobStore:saveBytes summary
8blobStore:saveBytes{quantile="0.5",} 0.161480703
1blobStore:saveBytes{quantile="0.75",} 0.22229811100000002
8blobStore:saveBytes{quantile="0.95",} 0.6081740790000001
8blobStore:saveBytes{quantile="0.98",} 0.7130316790000001
 blobStore:saveBytes{quantile="0.99",} 0.9437183990000001
 blobStore:saveBytes{quantile="0.999",} 1.677721599
 blobStore:saveBytes_count 815.0
 # HELP blobStore:read Generated from Dropwizard metric import (metric=blobStore:read, type=com.codahale.metrics.Timer)
0# TYPE blobStore:read summary
 blobStore:read{quantile="0.5",} 0.025952255
 blobStore:read{quantile="0.75",} 0.044564479000000004
 blobStore:read{quantile="0.95",} 0.08231321500000001
 blobStore:read{quantile="0.98",} 0.11167334300000001
 blobStore:read{quantile="0.99",} 0.146800639
0blobStore:read{quantile="0.999",} 0.185597951
 blobStore:read_count 677.0
 # HELP blobStoreCacheLatency Generated from Dropwizard metric import (metric=blobStoreCacheLatency, type=com.codahale.metrics.Timer)
1# TYPE blobStoreCacheLatency summary
9blobStoreCacheLatency{quantile="0.5",} 0.038797311
6blobStoreCacheLatency{quantile="0.75",} 0.063438847
3blobStoreCacheLatency{quantile="0.95",} 0.11481907100000001
kblobStoreCacheLatency{quantile="0.98",} 0.13841203100000002
 blobStoreCacheLatency{quantile="0.99",} 0.161480703
 blobStoreCacheLatency{quantile="0.999",} 0.212860927
 blobStoreCacheLatency_count 47788.0
   0 --# HELP blobStoreBackEndLatency Generated from Dropwizard metric import (metric=blobStoreBackEndLatency, type=com.codahale.metrics.Timer)
:# TYPE blobStoreBackEndLatency summary
-blobStoreBackEndLatency{quantile="0.5",} 0.033030143000000005
-blobStoreBackEndLatency{quantile="0.75",} 0.05845811100000001
:blobStoreBackEndLatency{quantile="0.95",} 0.11691622300000001
-blobStoreBackEndLatency{quantile="0.98",} 0.175112191
-blobStoreBackEndLatency{quantile="0.99",} 0.224395263
 blobStoreBackEndLatency{quantile="0.999",} 1.6525557750000002
-blobStoreBackEndLatency_count 5697.0
-# HELP blobStoreCacheHits_total Generated from Dropwizard metric import (metric=blobStoreCacheHits, type=com.codahale.metrics.Meter)
:# TYPE blobStoreCacheHits_total counter
-blobStoreCacheHits_total 43627.
quantranhong1999 commented 2 months ago

WIP: https://github.com/apache/james-project/pull/2225