linagora / james-project

Mirror of Apache James Project
Apache License 2.0
70 stars 63 forks source link

Diagnose file leak (on subscriptions cancelation?) #5093

Closed chibenwa closed 6 months ago

chibenwa commented 6 months ago

On some JMAP pods

root@tmail-jmap-9b8fdf456-dqdf5:/# ls -lah /tmp
total 334M
drwxrwxrwt 3 root root  420 Feb 26 20:11 .
drwxr-xr-x 1 root root   29 Feb 26 17:25 ..
-rw------- 1 root root  13M Feb 26 17:25 FileBackedOutputStream10594094494347736916.tmp
-rw------- 1 root root  25M Feb 26 17:25 FileBackedOutputStream10980707325171474942.tmp
-rw------- 1 root root  22M Feb 26 17:25 FileBackedOutputStream11687724217761581125.tmp
-rw------- 1 root root 5.9M Feb 26 17:38 FileBackedOutputStream119019363184853142.tmp
-rw------- 1 root root  17M Feb 26 17:38 FileBackedOutputStream1307685673793621723.tmp
-rw------- 1 root root  26M Feb 26 17:25 FileBackedOutputStream14638782752730555587.tmp
-rw------- 1 root root  26M Feb 26 17:25 FileBackedOutputStream15011331298270667407.tmp
-rw------- 1 root root  11M Feb 26 17:25 FileBackedOutputStream15089590116576036537.tmp
-rw------- 1 root root  22M Feb 26 17:25 FileBackedOutputStream17289452054817035875.tmp
-rw------- 1 root root  26M Feb 26 17:25 FileBackedOutputStream1957476570418591459.tmp
-rw------- 1 root root  11M Feb 26 17:25 FileBackedOutputStream2442599100874246678.tmp
-rw------- 1 root root  17M Feb 26 17:38 FileBackedOutputStream3510790307308971927.tmp
-rw------- 1 root root  22M Feb 26 17:25 FileBackedOutputStream3745625981530970388.tmp
-rw------- 1 root root  17M Feb 26 17:25 FileBackedOutputStream4126493809353276166.tmp
-rw------- 1 root root 9.2M Feb 26 17:25 FileBackedOutputStream5242658159025861895.tmp
-rw------- 1 root root  22M Feb 26 17:25 FileBackedOutputStream8249913266934356215.tmp
-rw------- 1 root root  25M Feb 26 17:25 FileBackedOutputStream8638915075211756941.tmp
-rw------- 1 root root  26M Feb 26 17:25 FileBackedOutputStream9238175485064170759.tmp
drwxr-xr-x 2 root root   60 Feb 26 17:25 hsperfdata_root

ANd in the logs:

java.io.IOException: No space left on device
    at java.base/java.io.FileOutputStream.writeBytes(Native Method)
    at java.base/java.io.FileOutputStream.write(Unknown Source)
    at com.google.common.io.FileBackedOutputStream.write(FileBackedOutputStream.java:218)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1486)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1105)
    at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1458)
    at org.apache.commons.io.IOUtils.copy(IOUtils.java:1083)
    at org.apache.james.server.blob.deduplication.DeDuplicationBlobStore.$anonfun$saveAndGenerateBlobId$1(DeDuplicationBlobStore.scala:85)
    at reactor.core.publisher.MonoCallable.call(MonoCallable.java:72)
    at reactor.core.publisher.FluxSubscribeOnCallable$CallableSubscribeOnSubscription.run(FluxSubscribeOnCallable.java:228)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
    at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

And

org.apache.james.blob.api.ObjectStoreIOException: Exception occurred while saving bytearray
    at org.apache.james.blob.aes.AESBlobStoreDAO.lambda$save$5(AESBlobStoreDAO.java:167)
    at reactor.core.publisher.Mono.lambda$onErrorMap$28(Mono.java:3799)
    at reactor.core.publisher.FluxOnErrorResume$ResumeSubscriber.onError(FluxOnErrorResume.java:94)
    at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.onError(MonoSubscribeOn.java:152)
    at reactor.core.publisher.Operators.error(Operators.java:198)
    at reactor.core.publisher.MonoUsing.subscribe(MonoUsing.java:78)
    at reactor.core.publisher.Mono.subscribe(Mono.java:4512)
    at reactor.core.publisher.MonoSubscribeOn$SubscribeOnSubscriber.run(MonoSubscribeOn.java:126)
    at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:84)
    at reactor.core.scheduler.WorkerTask.call(WorkerTask.java:37)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: java.lang.RuntimeException: Unable to build payload for object storage, failed to encrypt
    at org.apache.james.blob.aes.AESBlobStoreDAO.encrypt(AESBlobStoreDAO.java:72)
    at org.apache.james.blob.aes.AESBlobStoreDAO.lambda$save$3(AESBlobStoreDAO.java:163)
    at reactor.core.publisher.MonoUsing.subscribe(MonoUsing.java:75)
    ... 9 common frames omitted
Caused by: java.io.IOException: No space left on device
    at java.base/java.io.FileOutputStream.writeBytes(Native Method)
    at java.base/java.io.FileOutputStream.write(Unknown Source)
    at com.google.common.io.FileBackedOutputStream.write(FileBackedOutputStream.java:218)
    at com.google.crypto.tink.subtle.StreamingAeadEncryptingStream.write(StreamingAeadEncryptingStream.java:85)
    at java.base/java.io.InputStream.transferTo(Unknown Source)
    at java.base/java.io.FileInputStream.transferTo(Unknown Source)
    at org.apache.james.blob.aes.AESBlobStoreDAO.encrypt(AESBlobStoreDAO.java:68)
    ... 11 common frames omitted

Those are corner cases as this very rarely happen.

As far as I could see it looks DeletedMessageVaultCallback was the triggering event.

The damn thing runs with a 60s timeout (thanks rabbit) which might actually lead to cancelation.

Publisher cancelation might not be handled by some parts of our pipelines.

chibenwa commented 6 months ago

I tried a quick diagnostic but if we do not timely succeed to do that using LeakAware might be needed!

chibenwa commented 6 months ago

Spotted again today

-rw------- 1 root root  20M Feb 27 14:45 aesencrypt1090356846759433124.tmp
-rw------- 1 root root 1.1M Feb 27 14:45 aesencrypt12150808671220843742.tmp
-rw------- 1 root root 190K Feb 27 15:13 aesencrypt12270433280743716700.tmp
-rw------- 1 root root 308K Feb 27 15:13 aesencrypt14012496583013195642.tmp
-rw------- 1 root root 1.6M Feb 27 14:45 aesencrypt14207565799113878226.tmp
-rw------- 1 root root 819K Feb 27 15:13 aesencrypt16234583170132255200.tmp
-rw------- 1 root root  29M Feb 27 14:45 aesencrypt17148122387552507888.tmp
-rw------- 1 root root 1.2M Feb 27 14:45 aesencrypt18202933902701442902.tmp
-rw------- 1 root root 447K Feb 27 14:45 aesencrypt3445986306924536171.tmp
-rw------- 1 root root 432K Feb 27 15:13 aesencrypt4454941668781397544.tmp
-rw------- 1 root root 844K Feb 27 15:13 aesencrypt7032727805399581833.tmp
-rw------- 1 root root 1.1M Feb 27 14:45 aesencrypt7142968592193439328.tmp
-rw------- 1 root root 1.5M Feb 27 14:45 aesencrypt729192674429865014.tmp
-rw------- 1 root root 7.9M Feb 27 14:45 FileBufferedBodyFactory12196286621053298905.tmp
-rw------- 1 root root 147K Feb 27 14:45 FileBufferedBodyFactory1286078198665713980.tmp
-rw------- 1 root root 1.1M Feb 27 14:45 FileBufferedBodyFactory14551617268047995378.tmp
-rw------- 1 root root 511K Feb 27 14:45 FileBufferedBodyFactory16413942727464907789.tmp
-rw------- 1 root root  27M Feb 27 14:45 FileBufferedBodyFactory1818233457731727059.tmp
-rw------- 1 root root 1.5M Feb 27 14:45 FileBufferedBodyFactory3831952595233933979.tmp
-rw------- 1 root root 446K Feb 27 14:45 FileBufferedBodyFactory5940323976140619987.tmp
-rw------- 1 root root 146K Feb 27 14:45 FileBufferedBodyFactory6010870541582569665.tmp
-rw------- 1 root root 1.2M Feb 27 14:45 FileBufferedBodyFactory660729112906107165.tmp
-rw------- 1 root root 2.7M Feb 27 14:45 FileBufferedBodyFactory8039137499703867154.tmp
-rw------- 1 root root 2.7M Feb 27 14:45 FileBufferedBodyFactory8138595856361894791.tmp
drwxr-xr-x 2 root root   60 Feb 27 14:45 hsperfdata_root
-rw------- 1 root root 845K Feb 27 15:13 imap-literal10068127184381344361.tmp
-rw------- 1 root root  29M Feb 27 14:45 imap-literal1157096732197034832.tmp
-rw------- 1 root root  15M Feb 27 14:45 imap-literal13899523218875244485.tmp
-rw------- 1 root root 310K Feb 27 15:13 imap-literal14011594112211787081.tmp
-rw------- 1 root root 816K Feb 27 15:13 imap-literal14036192792378682392.tmp
-rw------- 1 root root 7.5M Feb 27 15:13 imap-literal14048239475853386882.tmp
-rw------- 1 root root 6.4M Feb 27 14:45 imap-literal18314332679400082689.tmp
-rw------- 1 root root 7.9M Feb 27 14:45 imap-literal18397798873854138783.tmp
-rw------- 1 root root  55M Feb 27 14:45 imap-literal3133932598444182451.tmp
-rw------- 1 root root 431K Feb 27 15:13 imap-literal4162503995144712492.tmp
-rw------- 1 root root 1.1M Feb 27 14:45 imap-literal5724799145470707674.tmp
-rw------- 1 root root 200K Feb 27 15:13 imap-literal592552489419855185.tmp
-rw------- 1 root root 1.5M Feb 27 14:45 imap-literal6784841676601606534.tmp
-rw------- 1 root root  40K Feb 27 15:13 imap-literal703355375567653488.tmp
-rw------- 1 root root  12M Feb 27 14:45 imap-literal7265937497129667509.tmp

I was able to corelate the hours with a POD restart - killed by Over memory consumption by the K8s host.

1GK for netty off heap buffer, JVM memory structures and TMPFS is too little. I added 500 more MBs. That's for the OOM part.

However K8s, even with pre-stop, did not cleanup the files in /tmp making the issue worse. I suspect the OOM did not run the pre-stop because it hard killed the POC.

chibenwa commented 6 months ago

https://ci.linagora.com/linagora/lrs/saas/deployments/cnb/cnb-prd-apps/-/merge_requests/60

chibenwa commented 6 months ago

Screenshot from 2024-02-27 16-42-53

^^