grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.26k stars 3.79k forks source link

Many DirectByteBuffer with high capacity when use netty shaded client #11314

Open cudothanh-Nhan opened 1 week ago

cudothanh-Nhan commented 1 week ago

What version of gRPC-Java are you using?

1.60.0

What is your environment?

jdk-18.0.2.1-x64 Linux 3.10.0-1160.76.1.el7.x86_64

Client intialization?

        NettyChannelBuilder.forTarget(target)
            .withOption(ChannelOption.CONNECT_TIMEOUT_MILLIS, timeout)
            .defaultLoadBalancingPolicy("round_robin")
            .keepAliveTime(60, TimeUnit.SECONDS)
            .keepAliveWithoutCalls(true)
            .sslContext(
                GrpcSslContexts.forClient()
                    .trustManager(InsecureTrustManagerFactory.INSTANCE)
                    .build())

;

JVM properties?

/zserver/java/jdk-18.0.2.1/bin/java --add-opens=java.base/jdk.internal.misc=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED -Dio.netty.tryReflectionSetAccessible=true -Dzappname=kiki-asr-streaming-websocket -Dzappprof=production -Dzconfdir=conf -Dzconffiles=config.ini -Djzcommonx.version=LATEST -Dzicachex.version=LATEST -Dzlogconffile=log4j2.yaml -Dlog4j2.configurationFile=conf/production.log4j2.yaml -Dlog4j2.contextSelector=org.apache.logging.log4j.core.async.AsyncLoggerContextSelector -Dlog4j2.immediateFlush=false -Djava.net.preferIPv4Stack=true -XX:+AlwaysPreTouch -XX:+UseTLAB -XX:+ResizeTLAB -XX:+PerfDisableSharedMem -Xms1G -Xmx2G -XX:+UseG1GC -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=70 -XX:ParallelGCThreads=24 -XX:ConcGCThreads=24 -XX:+ParallelRefProcEnabled -XX:-ResizePLAB -XX:G1RSetUpdatingPauseTimePercent=5 -Dspring.config.location=optional:file:./conf/production.spring.yaml -Dorg.springframework.boot.logging.LoggingSystem=none -jar /zserver/java-projects/kiki-asr-streaming-websocket/dist/kiki-asr-streaming-websocket-1.3.1.jar

What did you expect to see?

Stable number of DirectByteBuffer objects

What did you see instead?

Increasing number of DirectByteBuffer objects.

This is my OQL to list capacity of about 1,832 objects image

This is the GC root references from sample FastThreadLocalThread which contains DirectByteBuffer with capacity about 2MB and there are a lot of object like that image

Besides, I noted that there are many DirectByteBuffer which has null cleaner. Is it the intentional impletation of netty.

Steps to reproduce the bug

ejona86 commented 1 week ago

Increasing number of DirectByteBuffer objects.

That doesn't tell us much. And you only give us one data point.

Does your machine have many cores? #4317 and #5671 are about many threads. The screenshot shows details of an EpollEventLoop; we would expect there to be a cache there.

cudothanh-Nhan commented 1 week ago

Our app run on machine with 48 cores. I could give you the full heap dump here. https://drive.google.com/file/d/1ycFKIrlkxqTIVYuciAIw0j2RyZR4pupS/view?usp=sharing

Given that we expect there to be a cache for each EpollEventLoop but it seems too much memory. From the heap dump, you can see one event loop contains about 16 DirectByteBuffers in small subpage area, each has capacity of 2MB. Meaning that each event loop occupies about 16 x 2 = 32 MB -> 40MB

Does it sound reasonable? @ejona86

cudothanh-Nhan commented 1 week ago

I also wonder whether we have a limitation on the number of DirectByteBuffers inside each subpage area

ejona86 commented 1 week ago

gRPC reduces the subpage size to 2 MiB, to reduce memory. It also reduces the number of threads to number of cores. I think what's hurting here is the number of threads. If we reduced the number of threads by half, would that get into a reasonable state, or are you hoping for even more memory usage redection?

cudothanh-Nhan commented 1 week ago

I mean while the size of each subpage is only 2MB, there is also a potential memory pressure when there are many objects of them. Even though if my server only has 1 cores, one eventloop can contains multiple subpage, 2MB each @ejona86

cudothanh-Nhan commented 1 week ago

After diving deep inside netty implementation, while the number of PoolChunk object is stable (48 objects for 48 cores), I have found many DirectByteBuffer objects referenced by PoolThreadCache (about 1,154 objects as shown in the below image).

image

Given that my GRPC Client use default grpc executor which is, in turn, a cache thread executor.

Is native memory occupied by DirectByteBuffer freed after executor thread no longer exist? I think no because I see a lot of DirectByteBuffer objects are holding in PoolThreadCache

cudothanh-Nhan commented 1 week ago

It seems that one PoolThreadCache can contain many DirectByteBuffer objects, so that if one PoolThreadCache contain 40 SmallSubPageDirectCaches, it can consume up to 2MB * 40 = 80 MB native memory.

image

Am I right?

hakusai22 commented 6 hours ago

@ejona86 Hello, is there any progress on this issue?