linagora / james-project

Mirror of Apache James Project
Apache License 2.0
72 stars 62 forks source link

Upgrade dependency opensearch-java 2.1.0 to newer #4821

Open vttranlina opened 1 year ago

vttranlina commented 1 year ago

The update opensearch-java 2.1.0 -> 2.6.0 make a test case fail.

Investigate why and update it successfully

https://github.com/apache/james-project/pull/1647#issuecomment-1639180718

https://github.com/apache/james-project/pull/1647#issuecomment-1639286447

EDIT @Arsnael :

We need as well to:

Arsnael commented 1 year ago

JIRA: https://issues.apache.org/jira/browse/JAMES-3929

Arsnael commented 1 year ago

PR: https://github.com/apache/james-project/pull/1648

Locally it seemed to pass, but let's see with the CI. Then I will run some perf tests

Arsnael commented 1 year ago

opensearch.zip

Flame graphs and thread dump... Everything seems blocked but really hard to see what's wrong. I think I need to do those maybe more like a bit after I started a run? I waited too long and James seems completely bloated. Will try again

quantranhong1999 commented 1 year ago

Can u try to get a heap dump (may be upon OOM - JVM has an option for this) and share to the team?

chibenwa commented 1 year ago

A CPU spending 80% of it's time allocating arrays: there's something wrong!

Arsnael commented 1 year ago

A CPU spending 80% of it's time allocating arrays: there's something wrong!

Noticed that, not sure where to look at though yet^^'

Can u try to get a heap dump (may be upon OOM - JVM has an option for this) and share to the team?

I did actually, just forgot to upload it yesterday, as it was too big for the zip for github.

Arsnael commented 1 year ago

https://tdrive.linagora.com/shared/111111114AxHLNjGvEGJpA/drive/wAFGZg85kbTkqyFQa4XcL7/t/43aa80f3b2a79faa34d4635e6627be9d0f17d63d

This tie I tried to capture earlier, even if it gets bloated pretty quickly again.

I think in opensearch driver threads, what's interesting is:

    at org.apache.hc.core5.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:54)
    at org.opensearch.client.transport.httpclient5.internal.HeapBufferedAsyncEntityConsumer.data(HeapBufferedAsyncEntityConsumer.java:91)
    at org.apache.hc.core5.http.nio.entity.AbstractBinDataConsumer.consume(AbstractBinDataConsumer.java:75)
    at org.apache.hc.core5.http.nio.support.AbstractAsyncResponseConsumer.consume(AbstractAsyncResponseConsumer.java:141)
    at org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.consume(HttpAsyncMainClientExec.java:227)
    at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamHandler.consumeData(ClientHttp1StreamHandler.java:265)
    at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamDuplexer.consumeData(ClientHttp1StreamDuplexer.java:354)
    at org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onInput(AbstractHttp1StreamDuplexer.java:325)
    at org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.inputReady(AbstractHttp1IOEventHandler.java:64)
    at org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.inputReady(ClientHttp1IOEventHandler.java:41)
    at org.apache.hc.core5.reactor.ssl.SSLIOSession.decryptData(SSLIOSession.java:551)
    at org.apache.hc.core5.reactor.ssl.SSLIOSession.access$400(SSLIOSession.java:72)
    at org.apache.hc.core5.reactor.ssl.SSLIOSession$1.inputReady(SSLIOSession.java:172)
    at org.apache.hc.core5.reactor.InternalDataChannel.onIOEvent(InternalDataChannel.java:133)
    at org.apache.hc.core5.reactor.InternalChannel.handleIOEvent(InternalChannel.java:51)
    at org.apache.hc.core5.reactor.SingleCoreIOReactor.processEvents(SingleCoreIOReactor.java:178)
    at org.apache.hc.core5.reactor.SingleCoreIOReactor.doExecute(SingleCoreIOReactor.java:127)
    at org.apache.hc.core5.reactor.AbstractSingleCoreIOReactor.execute(AbstractSingleCoreIOReactor.java:85)
    at org.apache.hc.core5.reactor.IOReactorWorker.run(IOReactorWorker.java:44)
    at java.lang.Thread.runWith(java.base@20.0.1/Unknown Source)
    at java.lang.Thread.run(java.base@20.0.1/Unknown Source)

We can see it getting errors on ByteArrayBuffer, but from the core5 lib of apache. I;m not sure how far I can dig in it, seeing how it's hard to be able to fully focus on a task for me these days. But will try

quantranhong1999 commented 1 year ago

https://tdrive.linagora.com/shared/111111114AxHLNjGvEGJpA/drive/wAFGZg85kbTkqyFQa4XcL7/t/43aa80f3b2a79faa34d4635e6627be9d0f17d63d

Q: This heap dump is achieved from the JVM running tests, right?

A bit analysis: https://heaphero.io/heap-report-wc.jsp?p=RThCZ0lOakJ3K2dCd3FiSEc4VEhsdEkxUGFsRW8rR2tWWjNxdm5rZjVTeFBmUEVYc0xHVEpCK0NkSEF4MVpoYlFnaHBWNkU1dXJKVHpHdnFxOW1Xamc9PQ==

5 times "One instance of "org.apache.hc.core5.reactor.InternalDataChannel" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x740050000" occupies 104,961,680 (15.68%) bytes".

It is exactly 5 tests in WithCassandraBlobStoreTest.

So my guess is that test memory blows up because we do not release the hc5 client gracefully after each test / upon James shutdown.

I checked for production code, the ClientProvider bean is SINGLETON, so we should be good from OOM issue on prod.

Arsnael commented 1 year ago

I tried honestly to reproduce it first running WithCassandraBlobStoreTest, by even putting higher number of messages and I could not reproduce it locally^^' Thus why I switched back with sandbox

Arsnael commented 1 year ago

Really can't get around this. Been trying to check the code on opensearch-java but hardly understands all those schenanigans.

What I think though, is that if you look here: https://heaphero.io/heap-report-wc.jsp?p=RThCZ0lOakJ3K2dCd3FiSEc4VEhsdEkxUGFsRW8rR2tWWjNxdm5rZjVTeFBmUEVYc0xHVEpCK0NkSEF4MVpoYlFnaHBWNkU1dXJKVHpHdnFxOW1Xamc9PQ== (cool online tool btw @quantranhong1999 ) the value of one of the array byte reported as leaking suspect (actually same for all):

{"_scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFnAzVm5fQmdqUWltOFN1RGlDbTQ2cGcAAAAAAAWREBZYcjBBdjlwYlNST3ktd3kzSlBkcWlR","took":40,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":7179,"relation":"eq"},"max_score":null,"hits":[{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:403","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[403]},"sort":[403]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:404","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[404]},"sort":[404]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:405","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[405]},"sort":[405]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:406","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[406]},"sort":[406]},{"_index":"mailbox_v2","_id":"a29f0

So it seems to happen for scrolls. Maybe it doesn't clean up those until the scroll is finished, and because in preprod we have lots of records, it goes boom before the end? Something in those lines I guess.

Arsnael commented 9 months ago

IMAP :

Image

JMAP:

Image

Perf are similar to release 0.8.2, no impact on normal simulation tests

Will try imap full simulation with more occurences on the searchs. Also actually should as well disable searchOverrides that are enabled on ou staging environments to fully test the impact (just thought about it)

Arsnael commented 9 months ago

Well this is a run with searchOverrides off, only IMAP simulation.

0.8.2

Image

http5 update

Image

We can see quite a difference here on the searches. With the work on this PR, dropping the rest client, we loose 35% on searchDeleted and 48% on searchUnseen for P99. Also, there is a 10% perf loss as well on the mean for searchDeleted and searchUnseen, compared to 0.8.2.

I think the work done on the java client is still not good enough to get rid yet of the rest client for transport.

In favor to wait and not merge this work yet.

Other feedbacks welcomed :)

chibenwa commented 8 months ago

Agree

quantranhong1999 commented 8 months ago

In favor to wait and not merge this work yet.

+1

Arsnael commented 8 months ago

Alright putting it back to the backlog then, but was worth a try, seems better than last time already, just not good enough yet :)

Arsnael commented 4 months ago

Latest master IMAP

Image

Latest master with opensearch http5 transport client

Image

With searchoverrides disabled of course, to force the hits on opensearch.

Huge loss of perfs... Much worse than a few months ago actually, which is rather odd...

Also the docker image size for both is the same, so there is no gain in that either...

Not worth at all the switch

chibenwa commented 4 months ago

May I ask for flame graphs of the terrible run?

Arsnael commented 4 months ago

When I finish with release runs

Arsnael commented 4 months ago

flame_graphs.zip

The core5 lib used for transport with OS is clearly taking around 30% cpu and 50% memory. Badly optimized?