linagora / tmail-backend

GNU Affero General Public License v3.0
30 stars 17 forks source link

PGSQL Update James SHA-1 #1035

Closed Arsnael closed 2 weeks ago

Arsnael commented 1 month ago

https://james-jenkins.lin-saas.com/job/Tmail%20build/job/PR-1035/2/testReport/junit/com.linagora.tmail.deployment/PostgresImapAndSmtpTest/validateDeployment/

Investigating... Latest work is abot pool connection. Seems to alter the server responses order somehow on the deployment tests.

Arsnael commented 1 month ago

Investigating... Latest work is abot pool connection. Seems to alter the server responses order somehow on the deployment tests.

I dont get why on imap mpt tests there is some tests with flag updates on multiple messages and it's fine, while deployment tests somehow the response order is altered.

Does not happen before the migration to pg pool connection. @hungphan227 any ideas perhaps?

hungphan227 commented 1 month ago

I have reverted james sub module back to a commit before pool:

Screenshot from 2024-05-21 17-54-10

Then ran the test, it still fail:


org.apache.james.mpt.protocol.ProtocolSession$InvalidServerResponseException: 
Location: /org/apache/james/imap/scripts/ValidateDeployment.test:299
LastClientMsg: A15 STORE 1:2  +FLAGS (\FLAGGED)
Expected: '\* 1 FETCH \(FLAGS \(\\Flagged \\Recent\)\)'
Actual   : '* 2 FETCH (FLAGS (\Flagged \Recent))'
chibenwa commented 1 month ago

@hungphan227 this looks interesting....

Arsnael commented 1 month ago

I suggest you use git bisect to find out the one commit that introduced the bug...

Actually did... For the commit that commented Hung, it was working for me somehow. The work on the pool was my bisect bad.

@hungphan227 the deployment tests are also based on the local image built with jib during compilation. Just to be sure, when you reverted, did you recompile the all project and rebuilt the image with jib before relaunching the test? If not you likely retested on this work.

Does only the PGSQL backend fails or all backend fails?

Only pgsql

chibenwa commented 1 month ago

Do we have a way to reproduce the failure in James code base?

This would make it way easier to treat and manage the issue...

Arsnael commented 1 month ago

Do we have a way to reproduce the failure in James code base?

As said I tried using the similar scripts with mpt imap tests. In PostgresSelectedState, based on SelectedState test, some tests rely on the Store.test script, that has a store operation a bit similar. But it is green.

Tried as well tinkering the script a bit storing flags in more messages (like storing the Answered flag from message 1:4) but still green...

Thought maybe then it's TMail? But The binding of the pool is done in PostgresCommonModule that is bound already in tmailPostgresServer as well...

Arsnael commented 1 month ago

fix => https://github.com/apache/james-project/pull/2259

Was painful to do debugging by add logs everywhere in that part of the code, rebuilding the app and running the tests checking the logs... But I pinned it down to that line. concatMap instead of flatMap at that line ensure we keep the right uid order for the fetch responses of the store imap command

Arsnael commented 1 month ago

Fix merged, let's see if the CI now is happy

Arsnael commented 1 month ago

Can reproduce locally. Some things are wrong with postgres integration tests on tmail now

hungphan227 commented 1 month ago

I have fixed some tests in TeamMailboxesContract. PR: https://github.com/apache/james-project/pull/2267 However, there is still one test I could not fix: webSocketShouldPushTeamMailboxStateChanges

Arsnael commented 1 month ago

I have fixed some tests in TeamMailboxesContract. PR: https://github.com/apache/james-project/pull/2267

Are you sure this is not some isolation issue in the tests instead?

hungphan227 commented 1 month ago

I have fixed some tests in TeamMailboxesContract. PR: apache/james-project#2267

Are you sure this is not some isolation issue in the tests instead?

"Isolation issue" -> what do you mean?

Arsnael commented 1 month ago

"Isolation issue" -> what do you mean?

That I think you get some key violation constraints because of entries already present and not cleaned up well from other tests.

hungphan227 commented 1 month ago

"Isolation issue" -> what do you mean?

That I think you get some key violation constraints because of entries already present and not cleaned up well from other tests.

That is not possible. We always drop all tables after each test

Arsnael commented 1 month ago

That is not possible. We always drop all tables after each test

Last commit, works for me

EDIT: it does not fix issues with websocket though, that's an other issue

quantranhong1999 commented 1 month ago

I have fixed some tests in TeamMailboxesContract. PR: https://github.com/apache/james-project/pull/2267

cc @hungphan227 can you check if it is the case for label change (TMail side)?

hungphan227 commented 1 month ago

I have fixed some tests in TeamMailboxesContract. PR: apache/james-project#2267

cc @hungphan227 can you check if it is the case for label change (TMail side)?

If Rene commit works, my PR is not needed. However, I did not see any real difference made in Rene commit. Really surprised if it actually work

quantranhong1999 commented 1 month ago

If Rene commit works, my PR is not needed.

Is upsert supported by the Cassandra implementation?

hungphan227 commented 1 month ago

If Rene commit works, my PR is not needed.

Is upsert supported by the Cassandra implementation?

Yes

quantranhong1999 commented 1 month ago

Yes

Then we still need your upsert fix IMO.

Arsnael commented 1 month ago

Ok with that

hungphan227 commented 1 month ago

Yes

Then we still need your upsert fix IMO.

Does Cassandra impl supporting upsert mean that postgres impl should support too?

Arsnael commented 1 month ago

All implems should have similar behavior on this kind of operation

Arsnael commented 1 month ago

Very unstable...

Sometimes some tests pass and sometimes not. My guess is there is a slight chance maybe with the email and mailbox change that there is two requests on different pool threads being executed in parallel with the same state, thus why sometimes postgres throws an exception on duplicated primary key entry?

A wild guess... Hard to debug honestly, as most of the time when I try to play a test alone it's green u_u

hungphan227 commented 1 month ago

Very unstable...

Sometimes some tests pass and sometimes not. My guess is there is a slight chance maybe with the email and mailbox change that there is two requests on different pool threads being executed in parallel with the same state, thus why sometimes postgres throws an exception on duplicated primary key entry?

A wild guess... Hard to debug honestly, as most of the time when I try to play a test alone it's green u_u

Which test? Websocket?

Arsnael commented 1 month ago

Not always the same honestly. Once I even got a green build, to tell you how unstable it is

quantranhong1999 commented 1 month ago

Not always the same honestly. Once I even got a green build, to tell you how unstable it is

Hi. What is the name of the unstable test?

Arsnael commented 1 month ago

You launch the postgres-jmap-integration-tests suite and you will see locally.

Tests failing recurrently for me are :

But for example on my run now it seems it gets stuck on contactShouldBeIndexedWhenMailing in LinagoraContactAutocompleteMethodContract

I saw other tests sometimes red... and those tests sometimes being green.

So enjoy :)

quantranhong1999 commented 1 month ago

FYI all tmail postgres-jmap-integration-tests passed for me, except webSocketShouldPushTeamMailboxStateChanges.

Error log:

16:09:28.421 [ERROR] o.a.j.e.d.InVmEventDelivery - Error while processing listener
reactor.core.publisher.Sinks$EmissionException: Spec. Rule 1.3 - onSubscribe, onNext, onError and onComplete signaled to a Subscriber MUST be signaled serially.
    at reactor.core.publisher.InternalManySink.emitNext(InternalManySink.java:56)
    at org.apache.james.jmap.change.StateChangeListener.$anonfun$reactiveEvent$2(StateChangeListener.scala:37)
    at org.apache.james.jmap.change.StateChangeListener.$anonfun$reactiveEvent$2$adapted(StateChangeListener.scala:37)
    at scala.Option.foreach(Option.scala:437)
    at org.apache.james.jmap.change.StateChangeListener.$anonfun$reactiveEvent$1(StateChangeListener.scala:37)
    at org.apache.james.jmap.change.StateChangeListener.$anonfun$reactiveEvent$1$adapted(StateChangeListener.scala:35)
    at reactor.core.publisher.MonoCallable$MonoCallableSubscription.request(MonoCallable.java:137)
    at reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber.onSubscribe(MonoIgnoreElements.java:72)
    at reactor.core.publisher.MonoCallable.subscribe(MonoCallable.java:48)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4512)
    at reactor.core.publisher.MonoIgnorePublisher.subscribe(MonoIgnorePublisher.java:57)
    at reactor.core.publisher.MonoUsing.subscribe(MonoUsing.java:109)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4512)
    at reactor.core.publisher.FluxFlatMap$FlatMapMain.onNext(FluxFlatMap.java:430)
    at reactor.core.publisher.FluxIterable$IterableSubscription.slowPath(FluxIterable.java:335)
    at reactor.core.publisher.FluxIterable$IterableSubscription.request(FluxIterable.java:294)
    at reactor.core.publisher.FluxFlatMap$FlatMapMain.onSubscribe(FluxFlatMap.java:373)
    at reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:201)
    at reactor.core.publisher.FluxIterable.subscribe(FluxIterable.java:83)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4512)
    at reactor.core.publisher.FluxFlatMap$FlatMapMain.onNext(FluxFlatMap.java:430)
    at reactor.core.publisher.FluxArray$ArraySubscription.slowPath(FluxArray.java:126)
    at reactor.core.publisher.FluxArray$ArraySubscription.request(FluxArray.java:99)
    at reactor.core.publisher.FluxFlatMap$FlatMapMain.onSubscribe(FluxFlatMap.java:373)
    at reactor.core.publisher.FluxMerge.subscribe(FluxMerge.java:73)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:241)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:204)
    at reactor.core.publisher.MonoUsing$MonoUsingSubscriber.onComplete(MonoUsing.java:283)
    at reactor.core.publisher.MonoUsingWhen$MonoUsingWhenSubscriber.deferredComplete(MonoUsingWhen.java:271)
    at reactor.core.publisher.FluxUsingWhen$CommitInner.onComplete(FluxUsingWhen.java:528)
    at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)
    at reactor.core.publisher.Operators$MultiSubscriptionSubscriber.onComplete(Operators.java:2231)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:210)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:210)
    at reactor.pool.SimpleDequePool.maybeRecycleAndDrain(SimpleDequePool.java:537)
    at reactor.pool.SimpleDequePool$QueuePoolRecyclerInner.onComplete(SimpleDequePool.java:767)
    at reactor.core.publisher.Operators.complete(Operators.java:137)
    at reactor.core.publisher.MonoEmpty.subscribe(MonoEmpty.java:46)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4512)
    at reactor.pool.SimpleDequePool$QueuePoolRecyclerMono.subscribe(SimpleDequePool.java:879)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:241)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:204)
    at reactor.core.publisher.FluxPeek$PeekSubscriber.onComplete(FluxPeek.java:260)
    at reactor.core.publisher.Operators.complete(Operators.java:137)
    at reactor.core.publisher.MonoEmpty.subscribe(MonoEmpty.java:46)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:264)
    at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:241)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.onComplete(MonoIgnoreThen.java:204)
    at reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber.onComplete(MonoIgnoreElements.java:89)
    at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onComplete(FluxHandleFuseable.java:239)
    at reactor.core.publisher.MonoSupplier$MonoSupplierSubscription.request(MonoSupplier.java:148)
    at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.request(FluxHandleFuseable.java:260)
    at reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber.onSubscribe(MonoIgnoreElements.java:72)
    at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onSubscribe(FluxHandleFuseable.java:164)
    at reactor.core.publisher.MonoSupplier.subscribe(MonoSupplier.java:48)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
    at reactor.core.publisher.MonoIgnoreThen$ThenIgnoreMain.subscribeNext(MonoIgnoreThen.java:264)
    at reactor.core.publisher.MonoIgnoreThen.subscribe(MonoIgnoreThen.java:51)
    at reactor.core.publisher.InternalMonoOperator.subscribe(InternalMonoOperator.java:76)
    at reactor.core.publisher.MonoDefer.subscribe(MonoDefer.java:53)
    at reactor.core.publisher.FluxUsingWhen$UsingWhenSubscriber.onComplete(FluxUsingWhen.java:385)
    at reactor.core.publisher.MonoIgnoreElements$IgnoreElementsSubscriber.onComplete(MonoIgnoreElements.java:89)
    at reactor.core.publisher.SerializedSubscriber.onComplete(SerializedSubscriber.java:146)
    at reactor.core.publisher.FluxRetryWhen$RetryWhenMainSubscriber.onComplete(FluxRetryWhen.java:203)
    at reactor.core.publisher.MonoPeekTerminal$MonoTerminalPeekSubscriber.onComplete(MonoPeekTerminal.java:299)
    at reactor.core.publisher.SerializedSubscriber.onComplete(SerializedSubscriber.java:146)
    at reactor.core.publisher.FluxTimeout$TimeoutMainSubscriber.onComplete(FluxTimeout.java:235)
    at reactor.core.publisher.FluxHide$SuppressFuseableSubscriber.onComplete(FluxHide.java:147)
    at reactor.core.publisher.MonoNext$NextSubscriber.onComplete(MonoNext.java:102)
    at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:83)
    at reactor.core.publisher.MonoNext$NextSubscriber.onNext(MonoNext.java:82)
    at org.jooq.impl.Internal$1.onNext(Internal.java:749)
    at org.jooq.impl.R2DBC$Forwarding.onNext(R2DBC.java:242)
    at org.jooq.impl.R2DBC$RowCountSubscriber.lambda$onNext$0(R2DBC.java:322)
    at org.jooq.impl.Internal$1.onNext(Internal.java:749)
    at reactor.core.publisher.StrictSubscriber.onNext(StrictSubscriber.java:89)
    at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onNext(FluxHandleFuseable.java:194)
    at reactor.core.publisher.Operators$BaseFluxToMonoOperator.completePossiblyEmpty(Operators.java:2097)
    at reactor.core.publisher.MonoCollectList$MonoCollectListSubscriber.onComplete(MonoCollectList.java:118)
    at reactor.core.publisher.FluxHandle$HandleSubscriber.onComplete(FluxHandle.java:223)
    at reactor.core.publisher.MonoFlatMapMany$FlatMapManyInner.onComplete(MonoFlatMapMany.java:260)
    at reactor.core.publisher.FluxHandleFuseable$HandleFuseableSubscriber.onComplete(FluxHandleFuseable.java:239)
    at reactor.core.publisher.FluxFilterFuseable$FilterFuseableConditionalSubscriber.onComplete(FluxFilterFuseable.java:391)
    at reactor.core.publisher.FluxContextWrite$ContextWriteSubscriber.onComplete(FluxContextWrite.java:126)
    at reactor.core.publisher.FluxPeekFuseable$PeekConditionalSubscriber.onComplete(FluxPeekFuseable.java:940)
    at reactor.core.publisher.FluxPeekFuseable$PeekConditionalSubscriber.onComplete(FluxPeekFuseable.java:940)
    at io.r2dbc.postgresql.util.FluxDiscardOnCancel$FluxDiscardOnCancelSubscriber.onComplete(FluxDiscardOnCancel.java:104)
    at reactor.core.publisher.FluxDoFinally$DoFinallySubscriber.onComplete(FluxDoFinally.java:128)
    at reactor.core.publisher.FluxHandle$HandleSubscriber.onComplete(FluxHandle.java:223)
    at reactor.core.publisher.FluxCreate$BaseSink.complete(FluxCreate.java:465)
    at reactor.core.publisher.FluxCreate$BufferAsyncSink.drain(FluxCreate.java:871)
    at reactor.core.publisher.FluxCreate$BufferAsyncSink.complete(FluxCreate.java:819)
    at reactor.core.publisher.FluxCreate$SerializedFluxSink.drainLoop(FluxCreate.java:249)
    at reactor.core.publisher.FluxCreate$SerializedFluxSink.drain(FluxCreate.java:215)
    at reactor.core.publisher.FluxCreate$SerializedFluxSink.complete(FluxCreate.java:206)
    at io.r2dbc.postgresql.client.ReactorNettyClient$Conversation.complete(ReactorNettyClient.java:668)
    at io.r2dbc.postgresql.client.ReactorNettyClient$BackendMessageSubscriber.emit(ReactorNettyClient.java:934)
    at io.r2dbc.postgresql.client.ReactorNettyClient$BackendMessageSubscriber.onNext(ReactorNettyClient.java:810)
    at io.r2dbc.postgresql.client.ReactorNettyClient$BackendMessageSubscriber.onNext(ReactorNettyClient.java:716)
    at reactor.core.publisher.FluxHandle$HandleSubscriber.onNext(FluxHandle.java:129)
    at reactor.core.publisher.FluxPeekFuseable$PeekConditionalSubscriber.onNext(FluxPeekFuseable.java:854)
    at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:224)
    at reactor.core.publisher.FluxMap$MapConditionalSubscriber.onNext(FluxMap.java:224)
    at reactor.netty.channel.FluxReceive.drainReceiver(FluxReceive.java:294)
    at reactor.netty.channel.FluxReceive.onInboundNext(FluxReceive.java:403)
    at reactor.netty.channel.ChannelOperations.onInboundNext(ChannelOperations.java:426)
    at reactor.netty.channel.ChannelOperationsHandler.channelRead(ChannelOperationsHandler.java:114)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
    at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
    at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
    at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
    at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
    at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:800)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:509)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:407)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.base/java.lang.Thread.run(Thread.java:1583)

I changed the memory eventbus to RabbitMQ but still the quite similar error. I tested the Distributed/Memory IT but I do not see the same error. => This error log may be a good lead IMO.

I propose to disable the webSocketShouldPushTeamMailboxStateChanges test and handle it in another ticket.

Arsnael commented 1 month ago

Problem is I had random errors on other tests, even locally. More than one test is unstable IMO

quantranhong1999 commented 1 month ago

Problem is I had random errors on other tests, even locally. More than one test is unstable IMO

Hmm BTW unstable is the nature of those JMAP ITs IMO.

vttranlina commented 1 month ago

I also get the unstable on webSocketShouldPushTeamMailboxStateChanges But the latest postgres is passed, it is unstable when commit upgrade james sha1

quantranhong1999 commented 1 month ago

I try to disable the webSocketShouldPushTeamMailboxStateChanges test here: https://github.com/linagora/tmail-backend/pull/1091

Let's see...

vttranlina commented 1 month ago

The issue comes from PoolBackedPostgresConnectionFactory Commit: https://github.com/apache/james-project/commit/435600614fcfbc9bb2f4dccfea1b43a5069bd5a6#diff-29c9eefc2a62e1248232ac97ddadeaab789ffb85b10351e13e18aaa4e26c2195 (the change of PostgresCommonModule.java)

I tried to revert it to SinglePostgresConnectionFactory and webSocketShouldPushTeamMailboxStateChanges passed.

It looks like an issue related to what we did with Reactor (not logic issue), the StateChangeListener already emits 2 messages, but the client subscriber receives only one.

Arsnael commented 1 month ago

Thanks for the help team, that should do it :)

vttranlina commented 1 month ago

Note that my fixup commit just makes test webSocketShouldPushTeamMailboxStateChanges not hanging It still unstable

We can @disable it as Quan's suggestion

// the 2 prs in james will help postgres-app better, (it doesn't fix the unstable)

Arsnael commented 1 month ago

It still unstable

Is it? Seemed okish to me :)

Arsnael commented 1 month ago

When I said other tests were unstable and not just the websocket one... https://james-jenkins.lin-saas.com/job/Tmail%20build/job/PR-1035/10/

T_T