grpc / grpc-java

The Java gRPC implementation. HTTP/2 based RPC
https://grpc.io/docs/languages/java/
Apache License 2.0
11.36k stars 3.82k forks source link

gRPC Java server leaks memory on large requests since updating JDK Docker image #10589

Closed JonathanShifman closed 9 months ago

JonathanShifman commented 11 months ago

We have a gRPC server written in Java. The application is using SpringBoot version 2.5.7.

In an attempt to isolate the issue, I stripped the project down to contain only one unary gRPC endpoint that handles file uploads. The maximum size of an inbound message was increased to 400MB, and the files are being sent as a ByteString as part of the gRPC request:

// Upload an object with a unique ObjectIdentifier and corresponding object.
rpc UploadObject (UploadObjectRequest) returns (google.protobuf.Empty);

message UploadObjectRequest {
    ObjectIdentifier object_identifier = 1;
    bytes object = 2; // A byte array representing single object data.
}

Upon receiving the request, the server immediately returns a successful response to the client through the responseObserver object. The ByteString is discarded and is not written anywhere - We just send a response and return:

@Slf4j
@GRpcService
public class MyEndpoint extends MyServiceImplBase {
    @Override
    public void uploadObject(UploadObjectRequest request, StreamObserver<Empty> responseObserver) {
        responseObserver.onNext(Empty.getDefaultInstance());
        responseObserver.onCompleted();
    }
}

The application is deployed on Kubernetes (on 1 pod). We were using openjdk:17.0.2-jdk as a Docker image, until it got deprecated, then migrated to amazoncorretto:17.0.7-al2023.

Since migrating to the Amazon image, we are observing a memory leak, specifically in the direct buffer (non-heap) memory. To recreate and isolate the issue, I wrote a method that repeatedly invokes the endpoint with a large file (200MB):

@SneakyThrows
@Test
public void test() {
    while (true) {
    Channel channel = grpcClientChannelFactory.createGrpcChannel(host, port);
    MyServiceBlockingStub blockingStub = newBlockingStub(channel);

        final int numOfBytes = 200_000_000;
        byte[] bytes = new byte[numOfBytes];
        new Random().nextBytes(bytes);
        UploadObjectRequest uploadObjectRequest = UploadObjectRequest.newBuilder()
                .setObjectIdentifier(
                        ObjectIdentifier.newBuilder()
                                .setNamespace("namespace")
                                .setKey("testFile")
                )
                .setObject(ByteString.copyFrom(bytes))
                .build();

        blockingStub
                .withCallCredentials(new JwtCallCredential(JWT))
                .uploadObject(uploadObjectRequest);
        Thread.sleep(5000);
    }
}

Below is a comparison of memory-related metrics between the two images. Other than the Docker image, the source is identical in both cases.

I should mention that we tried numerous other images that are recommended as alternatives to openjdk:17.0.2-jdk, like amazoncorretto:21.0.0-al2023, eclipse-temurin:latest, ibmjava:latest, ibm-semeru-runtimes:open-17.0.8.1_1-jdk. The memory leak was reproduces in the same fashion in all of then. OpenJDK is the only image where we do not observe this issue.

Using openjdk:17.0.2-jdk:

container_memory_usage_bytes enter image description here

jvm_buffer_memory_used_bytes (direct memory bytes) enter image description here

jvm_buffer_count_buffers (direct memory num of buffers) enter image description here

jvm_memory_used_bytes (heap memory bytes) enter image description here

Using openjdk:amazoncorretto:17.0.7-al2023:

container_memory_usage_bytes enter image description here

jvm_buffer_memory_used_bytes (direct memory bytes) enter image description here

jvm_buffer_count_buffers (direct memory num of buffers) enter image description here

jvm_memory_used_bytes (heap memory bytes) enter image description here

Any insight into what might be causing this is appreciated, as well as suggestions for how I might further isolate the issue and find out what is the root cause.

sanjaypujare commented 11 months ago

Interesting! Few observations:

In any case let's just focus on the direct memory growth. gRPC uses Netty which uses direct memory. You can use JVM properties to print out logs related to potential leaks as described in https://netty.io/wiki/reference-counted-objects.html#troubleshooting-buffer-leaks . Could you use these to get some debug output and see if that points to something? That would be the first step.

JonathanShifman commented 11 months ago

Interesting! Few observations:

  • are you using the latest gRPC Java release? Which gRPC version are you using?
  • although heap memory shows some difference I don't see leaks and the difference is only 2x (approx)
  • the main difference is in direct memory num of buffers (jvm_buffer_count_buffers). For some reason the buffers are not getting released. This also shows up in direct memory bytes
  • The leak in direct memory seems to be the most obvious factor but there might be other factors as well since the direct memory growth (600M to 3G which is increase of ~2.5G) does not completely explain the container memory growth (3.5G to 8.5G increase of ~5G)

In any case let's just focus on the direct memory growth. gRPC uses Netty which uses direct memory. You can use JVM properties to print out logs related to potential leaks as described in https://netty.io/wiki/reference-counted-objects.html#troubleshooting-buffer-leaks . Could you use these to get some debug output and see if that points to something? That would be the first step.

Hi @sanjaypujare, apologies for taking so long to reply, had to divert attention to other issues for a while.

We are using grpc-spring-boot-starter version 4.7.0, imported using gradle like so: implementation ('io.github.lognet:grpc-spring-boot-starter:4.7.0'){ exclude group: 'io.grpc', module: 'grpc-netty-shaded' }

Looking at the imported external libraries, the grpc version we are using is 1.45.1.

I tried setting the log level to paranoid through the code: ResourceLeakDetector.setLevel(ResourceLeakDetector.Leve.PARANOID);

Unfortunately no logs were shown, both when running locally or when deployed on Kubernetes.

sanjaypujare commented 11 months ago

Hmmm, could you try using a different type instead of bytes type for object? Say string (and put your byte array as base64 encoded string) or repeated int32 ? That might indicate an issue with bytes .

Also you may want to report this issue in the netty repo since this seems to be a Netty's memory management issue.

larry-safran commented 10 months ago

@JonathanShifman have you tried using the latest versions of grpc-java and Netty to see if the problem went away? Have you contacted the Netty maintainers?

sergiitk commented 9 months ago

Seems like this is resolved as best we can, as it doesn't seem gRPC-specific. If not and there's something more we can help with, comment, and it can be reopened.