Closed JonathanShifman closed 9 months ago
Interesting! Few observations:
In any case let's just focus on the direct memory growth. gRPC uses Netty which uses direct memory. You can use JVM properties to print out logs related to potential leaks as described in https://netty.io/wiki/reference-counted-objects.html#troubleshooting-buffer-leaks . Could you use these to get some debug output and see if that points to something? That would be the first step.
Interesting! Few observations:
- are you using the latest gRPC Java release? Which gRPC version are you using?
- although heap memory shows some difference I don't see leaks and the difference is only 2x (approx)
- the main difference is in direct memory num of buffers (jvm_buffer_count_buffers). For some reason the buffers are not getting released. This also shows up in direct memory bytes
- The leak in direct memory seems to be the most obvious factor but there might be other factors as well since the direct memory growth (600M to 3G which is increase of ~2.5G) does not completely explain the container memory growth (3.5G to 8.5G increase of ~5G)
In any case let's just focus on the direct memory growth. gRPC uses Netty which uses direct memory. You can use JVM properties to print out logs related to potential leaks as described in https://netty.io/wiki/reference-counted-objects.html#troubleshooting-buffer-leaks . Could you use these to get some debug output and see if that points to something? That would be the first step.
Hi @sanjaypujare, apologies for taking so long to reply, had to divert attention to other issues for a while.
We are using grpc-spring-boot-starter
version 4.7.0, imported using gradle like so:
implementation ('io.github.lognet:grpc-spring-boot-starter:4.7.0'){ exclude group: 'io.grpc', module: 'grpc-netty-shaded' }
Looking at the imported external libraries, the grpc version we are using is 1.45.1.
I tried setting the log level to paranoid through the code:
ResourceLeakDetector.setLevel(ResourceLeakDetector.Leve.PARANOID);
Unfortunately no logs were shown, both when running locally or when deployed on Kubernetes.
Hmmm, could you try using a different type instead of bytes
type for object
? Say string
(and put your byte array as base64 encoded string) or repeated int32
? That might indicate an issue with bytes
.
Also you may want to report this issue in the netty repo since this seems to be a Netty's memory management issue.
@JonathanShifman have you tried using the latest versions of grpc-java and Netty to see if the problem went away? Have you contacted the Netty maintainers?
Seems like this is resolved as best we can, as it doesn't seem gRPC-specific. If not and there's something more we can help with, comment, and it can be reopened.
We have a gRPC server written in Java. The application is using SpringBoot version 2.5.7.
In an attempt to isolate the issue, I stripped the project down to contain only one unary gRPC endpoint that handles file uploads. The maximum size of an inbound message was increased to 400MB, and the files are being sent as a
ByteString
as part of the gRPC request:Upon receiving the request, the server immediately returns a successful response to the client through the
responseObserver
object. TheByteString
is discarded and is not written anywhere - We just send a response and return:The application is deployed on Kubernetes (on 1 pod). We were using openjdk:17.0.2-jdk as a Docker image, until it got deprecated, then migrated to amazoncorretto:17.0.7-al2023.
Since migrating to the Amazon image, we are observing a memory leak, specifically in the direct buffer (non-heap) memory. To recreate and isolate the issue, I wrote a method that repeatedly invokes the endpoint with a large file (200MB):
Below is a comparison of memory-related metrics between the two images. Other than the Docker image, the source is identical in both cases.
I should mention that we tried numerous other images that are recommended as alternatives to openjdk:17.0.2-jdk, like amazoncorretto:21.0.0-al2023, eclipse-temurin:latest, ibmjava:latest, ibm-semeru-runtimes:open-17.0.8.1_1-jdk. The memory leak was reproduces in the same fashion in all of then. OpenJDK is the only image where we do not observe this issue.
Using openjdk:17.0.2-jdk:
container_memory_usage_bytes
jvm_buffer_memory_used_bytes (direct memory bytes)
jvm_buffer_count_buffers (direct memory num of buffers)
jvm_memory_used_bytes (heap memory bytes)
Using openjdk:amazoncorretto:17.0.7-al2023:
container_memory_usage_bytes
jvm_buffer_memory_used_bytes (direct memory bytes)
jvm_buffer_count_buffers (direct memory num of buffers)
jvm_memory_used_bytes (heap memory bytes)
Any insight into what might be causing this is appreciated, as well as suggestions for how I might further isolate the issue and find out what is the root cause.