Open vttranlina opened 1 year ago
PR: https://github.com/apache/james-project/pull/1648
Locally it seemed to pass, but let's see with the CI. Then I will run some perf tests
Flame graphs and thread dump... Everything seems blocked but really hard to see what's wrong. I think I need to do those maybe more like a bit after I started a run? I waited too long and James seems completely bloated. Will try again
Can u try to get a heap dump (may be upon OOM - JVM has an option for this) and share to the team?
A CPU spending 80% of it's time allocating arrays: there's something wrong!
A CPU spending 80% of it's time allocating arrays: there's something wrong!
Noticed that, not sure where to look at though yet^^'
Can u try to get a heap dump (may be upon OOM - JVM has an option for this) and share to the team?
I did actually, just forgot to upload it yesterday, as it was too big for the zip for github.
This tie I tried to capture earlier, even if it gets bloated pretty quickly again.
I think in opensearch driver threads, what's interesting is:
at org.apache.hc.core5.util.ByteArrayBuffer.<init>(ByteArrayBuffer.java:54)
at org.opensearch.client.transport.httpclient5.internal.HeapBufferedAsyncEntityConsumer.data(HeapBufferedAsyncEntityConsumer.java:91)
at org.apache.hc.core5.http.nio.entity.AbstractBinDataConsumer.consume(AbstractBinDataConsumer.java:75)
at org.apache.hc.core5.http.nio.support.AbstractAsyncResponseConsumer.consume(AbstractAsyncResponseConsumer.java:141)
at org.apache.hc.client5.http.impl.async.HttpAsyncMainClientExec$1.consume(HttpAsyncMainClientExec.java:227)
at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamHandler.consumeData(ClientHttp1StreamHandler.java:265)
at org.apache.hc.core5.http.impl.nio.ClientHttp1StreamDuplexer.consumeData(ClientHttp1StreamDuplexer.java:354)
at org.apache.hc.core5.http.impl.nio.AbstractHttp1StreamDuplexer.onInput(AbstractHttp1StreamDuplexer.java:325)
at org.apache.hc.core5.http.impl.nio.AbstractHttp1IOEventHandler.inputReady(AbstractHttp1IOEventHandler.java:64)
at org.apache.hc.core5.http.impl.nio.ClientHttp1IOEventHandler.inputReady(ClientHttp1IOEventHandler.java:41)
at org.apache.hc.core5.reactor.ssl.SSLIOSession.decryptData(SSLIOSession.java:551)
at org.apache.hc.core5.reactor.ssl.SSLIOSession.access$400(SSLIOSession.java:72)
at org.apache.hc.core5.reactor.ssl.SSLIOSession$1.inputReady(SSLIOSession.java:172)
at org.apache.hc.core5.reactor.InternalDataChannel.onIOEvent(InternalDataChannel.java:133)
at org.apache.hc.core5.reactor.InternalChannel.handleIOEvent(InternalChannel.java:51)
at org.apache.hc.core5.reactor.SingleCoreIOReactor.processEvents(SingleCoreIOReactor.java:178)
at org.apache.hc.core5.reactor.SingleCoreIOReactor.doExecute(SingleCoreIOReactor.java:127)
at org.apache.hc.core5.reactor.AbstractSingleCoreIOReactor.execute(AbstractSingleCoreIOReactor.java:85)
at org.apache.hc.core5.reactor.IOReactorWorker.run(IOReactorWorker.java:44)
at java.lang.Thread.runWith(java.base@20.0.1/Unknown Source)
at java.lang.Thread.run(java.base@20.0.1/Unknown Source)
We can see it getting errors on ByteArrayBuffer, but from the core5 lib of apache. I;m not sure how far I can dig in it, seeing how it's hard to be able to fully focus on a task for me these days. But will try
Q: This heap dump is achieved from the JVM running tests, right?
A bit analysis: https://heaphero.io/heap-report-wc.jsp?p=RThCZ0lOakJ3K2dCd3FiSEc4VEhsdEkxUGFsRW8rR2tWWjNxdm5rZjVTeFBmUEVYc0xHVEpCK0NkSEF4MVpoYlFnaHBWNkU1dXJKVHpHdnFxOW1Xamc9PQ==
5 times "One instance of "org.apache.hc.core5.reactor.InternalDataChannel" loaded by "jdk.internal.loader.ClassLoaders$AppClassLoader @ 0x740050000" occupies 104,961,680 (15.68%) bytes".
It is exactly 5 tests in WithCassandraBlobStoreTest
.
So my guess is that test memory blows up because we do not release the hc5 client gracefully after each test / upon James shutdown.
I checked for production code, the ClientProvider
bean is SINGLETON, so we should be good from OOM issue on prod.
I tried honestly to reproduce it first running WithCassandraBlobStoreTest
, by even putting higher number of messages and I could not reproduce it locally^^' Thus why I switched back with sandbox
Really can't get around this. Been trying to check the code on opensearch-java but hardly understands all those schenanigans.
What I think though, is that if you look here: https://heaphero.io/heap-report-wc.jsp?p=RThCZ0lOakJ3K2dCd3FiSEc4VEhsdEkxUGFsRW8rR2tWWjNxdm5rZjVTeFBmUEVYc0xHVEpCK0NkSEF4MVpoYlFnaHBWNkU1dXJKVHpHdnFxOW1Xamc9PQ== (cool online tool btw @quantranhong1999 ) the value of one of the array byte reported as leaking suspect (actually same for all):
{"_scroll_id":"FGluY2x1ZGVfY29udGV4dF91dWlkDXF1ZXJ5QW5kRmV0Y2gBFnAzVm5fQmdqUWltOFN1RGlDbTQ2cGcAAAAAAAWREBZYcjBBdjlwYlNST3ktd3kzSlBkcWlR","took":40,"timed_out":false,"_shards":{"total":1,"successful":1,"skipped":0,"failed":0},"hits":{"total":{"value":7179,"relation":"eq"},"max_score":null,"hits":[{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:403","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[403]},"sort":[403]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:404","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[404]},"sort":[404]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:405","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[405]},"sort":[405]},{"_index":"mailbox_v2","_id":"a29f0a40-3079-11eb-ad60-d91e26006be3:406","_score":null,"_routing":"a29f0a40-3079-11eb-ad60-d91e26006be3","fields":{"uid":[406]},"sort":[406]},{"_index":"mailbox_v2","_id":"a29f0
So it seems to happen for scrolls. Maybe it doesn't clean up those until the scroll is finished, and because in preprod we have lots of records, it goes boom before the end? Something in those lines I guess.
IMAP :
JMAP:
Perf are similar to release 0.8.2, no impact on normal simulation tests
Will try imap full simulation with more occurences on the searchs. Also actually should as well disable searchOverrides that are enabled on ou staging environments to fully test the impact (just thought about it)
Well this is a run with searchOverrides off, only IMAP simulation.
We can see quite a difference here on the searches. With the work on this PR, dropping the rest client, we loose 35% on searchDeleted and 48% on searchUnseen for P99. Also, there is a 10% perf loss as well on the mean for searchDeleted and searchUnseen, compared to 0.8.2.
I think the work done on the java client is still not good enough to get rid yet of the rest client for transport.
In favor to wait and not merge this work yet.
Other feedbacks welcomed :)
Agree
In favor to wait and not merge this work yet.
+1
Alright putting it back to the backlog then, but was worth a try, seems better than last time already, just not good enough yet :)
With searchoverrides disabled of course, to force the hits on opensearch.
Huge loss of perfs... Much worse than a few months ago actually, which is rather odd...
Also the docker image size for both is the same, so there is no gain in that either...
Not worth at all the switch
May I ask for flame graphs of the terrible run?
When I finish with release runs
The core5 lib used for transport with OS is clearly taking around 30% cpu and 50% memory. Badly optimized?
The update opensearch-java 2.1.0 -> 2.6.0 make a test case fail.
Investigate why and update it successfully
https://github.com/apache/james-project/pull/1647#issuecomment-1639180718
https://github.com/apache/james-project/pull/1647#issuecomment-1639286447
EDIT @Arsnael :
We need as well to: