linagora / tmail-backend

GNU Affero General Public License v3.0
39 stars 22 forks source link

Cassandra DriverTimeoutException when sending a mail with cassandra mailqueue view enabled #813

Closed quantranhong1999 closed 1 year ago

quantranhong1999 commented 1 year ago

WHY

Env: TMail Release 0.7.1-rc1 on preprod When sending a mail with Cassandra mail queue view enabled, we have the errors:

From CNB (likely with mailqueue view disabled), error upon EmailSubmission/set: Image

From TMail preprod (with mail queue view enabled), error upon mail processing:

{
    "timestamp": "2023-09-12T01:27:58.056Z",
    "level": "WARN",
    "thread": "s1-timer-0",
    "logger": "org.apache.james.blob.cassandra.cache.CassandraBlobStoreCache",
    "message": "Fail reading blob store cache",
    "context": "default",
    "exception": "com.datastax.oss.driver.api.core.DriverTimeoutException: Query timed out after PT0.5S\n\tat com.datastax.oss.driver.internal.core.cql.CqlRequestHandler.lambda$scheduleTimeout$1(CqlRequestHandler.java:207)\n\tat io.netty.util.HashedWheelTimer$HashedWheelTimeout.run(HashedWheelTimer.java:715)\n\tat io.netty.util.concurrent.ImmediateExecutor.execute(ImmediateExecutor.java:34)\n\tat io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:703)\n\tat io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:790)\n\tat io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:503)\n\tat io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)\n\tat java.base/java.lang.Thread.run(Unknown Source)\n"
}

Seems something is blocking on the Cassandra driver thread, upon the mailqueue view being enabled.

DoD

Investigate and fix the issue.

Notes

Mean while, creating identity is fine (reported by Tung).

quantranhong1999 commented 1 year ago

I personally could not reproduce the EmailSubmission/set error on TMail preprod env, but see the error logs DriverTimeoutException upon mail processing, leading to mail is not actually sent (despite EmailSubmission/set succeeded).

My very first theory: something could be blocking in CachedBlobStore on the Cassandra driver thread which leads to DriverTimeoutException. There seems a lot of InputStream usage (which is blocking), I suspect some of InputStream usage was not correctly subscribed on the BLOCKING_CALL_WRAPPER thread.

TODO: try to debug in unit tests to see.

quantranhong1999 commented 1 year ago

Nothing suspicious with CachedBlobStoreTest.

-> I will try to put all Cassandra queries being executed on Schedulers.parallel() thread and performance test now to see if any obvious .block etc is being used.

chibenwa commented 1 year ago

-> I will try to put all Cassandra queries being executed on Schedulers.parallel() thread and performance test now to see if any obvious .block etc is being used.

FYI I opened https://datastax-oss.atlassian.net/jira/software/c/projects/JAVA/issues/JAVA-3121 to be able to customize the driver thread factory. It would enable to extend the NonBlocking interface from the Reactor project and thus enable rejection of blocking calls made on the Cassandra driver thread (with a bit of tunning).

No answer so far...

quantranhong1999 commented 1 year ago

So seems no obvious .block / .blockLast ... are used. Now I will use BlockHound to scan more tiny "heavy computing" code around CassandraBlobStoreCache / CassandraMessageFastViewProjection / EmailSubmissionSetMethodContract / IdentitySetContract.

quantranhong1999 commented 1 year ago

Well at least I spotted a blocking call on CachedBlobStore's cassandra thread: https://github.com/apache/james-project/pull/1725

I am a bit surprised because the UUID.randomUUID() has a part that blocks as well, while we seem to use it a lot on either JMAP netty threads or Cassandra thread.

  Caused by: reactor.blockhound.BlockingOperationError: Blocking call! java.io.FileInputStream#readBytes
    at java.base/java.io.FileInputStream.readBytes(FileInputStream.java)
    at java.base/java.io.FileInputStream.read(FileInputStream.java:276)
    at java.base/java.io.FilterInputStream.read(FilterInputStream.java:132)
    at java.base/sun.security.provider.NativePRNG$RandomIO.readFully(NativePRNG.java:425)
    at java.base/sun.security.provider.NativePRNG$RandomIO.ensureBufferValid(NativePRNG.java:528)
    at java.base/sun.security.provider.NativePRNG$RandomIO.implNextBytes(NativePRNG.java:547)
    at java.base/sun.security.provider.NativePRNG.engineNextBytes(NativePRNG.java:221)
    at java.base/java.security.SecureRandom.nextBytes(SecureRandom.java:758)
    at java.base/java.util.UUID.randomUUID(UUID.java:151)
    at org.apache.james.events.Event$EventId.random(Event.java:37)
    at org.apache.james.mailbox.store.event.EventFactory$RequireEventId.randomEventId(EventFactory.java:67)

I am not sure it is worth to put the boundedElastic thread just for UUID.randomUUID(), should be fast enough IMO?

chibenwa commented 1 year ago

Maybe it is a mistake but I chose to ignore that one.

Yes it is a blocking call, but it is shitty damn super fast. Fast enough to get a neglictible impact. (exercise: measure it with a JMH mini benchmark :-p see if you can make me change my mind)

I did not manage to come up with a better architecture for generating UIDs and random numbers that performing these blocking calls.

So I admit I did close my eyes until the community puts togeter something like of a rector-random library...