gRPC performance issues arise when the request is large and complex.

richieyan commented 1 year ago

Background

We encounter this problem when we want to switch our gRPC server from a raw gRPC server to an Armeria server. The main problem is that the response latency in p95 is much larger than that of the official server.

In our case, when using the raw gRPC server, the request QPS was about 10K-12K and the p95 latency was around 100ms.

However, when we switched to the armeria server, the p95 latency increased to around 300ms, despite maintaining the same QPS.

The request was for about 500K bytes or more. Then, I tested the deserialization time of the request and found it to be approximately 7 ms or more.

Another noticeable point was that the CPU ratio was lower than that of the raw gRPC server.

Guess of the reason

At first, we thought the problem was caused by the large request data, so we tried using unsafeWrapRequestBuffers. However, the p95 was not reduced.

After extensively examining the code, I decided to investigate the deserialization process. During my investigation, I discovered differences between the deserialization methods used in raw gRPC and armeria.

In the raw gRPC server, the deserialization is being processed in the blocking executor thread.

// io.grpc.internal.ServerImpl.JumpToApplicationThreadServerStreamListener#messagesAvailable
public void messagesAvailable(final MessageProducer producer) {
      final class MessagesAvailable extends ContextRunnable {
        MessagesAvailable() { super(context);}
        public void runInContext() {
          try {
            // will deserialize request in this method.
            getListener().messagesAvailable(producer);
          } catch (Throwable t) {
            internalClose(t);
            throw t;
          }
        }
      }
      // blocking exectuor(not eventloop)
      callExecutor.execute(new MessagesAvailable());
    }

In the armeria gRPC server, the deserialization processing occurs in the IO (EventLoop) thread.

// com.linecorp.armeria.internal.server.grpc.AbstractServerCall#onRequestMessage

// currently thread is a IO thread
final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat);
request = marshaller.deserializeRequest(message, grpcWebText);
maybeLogRequestContent(request);

if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) {
    GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx);
}

// switch to blockingExecutor
if (blockingExecutor != null) {
    blockingExecutor.execute(() -> invokeOnMessage(request, endOfStream));
} else {
    invokeOnMessage(request, endOfStream);
}

I guess the main problem is the time it takes for marshaller.deserializeRequest(message, grpcWebText) in IO thread. If the deserialization process takes too long, the IO task may experience significant delays.

Another point is that the Armeria gRPC server thread switch is more frequent than in raw gRPC. In Armeria, the request process involves IO, blocking, IO, blocking (for request processing), and then IO (for response). However, in raw gRPC server, the process involves IO, blocking (for request processing), and then IO (for response). I'm not sure how the thread switch will affect the amount of pending time for the task.

trustin commented 1 year ago

Thanks a lot for your valuable input, @richieyan. I agree that more frequent switches between threads can be a problem. Did you have a chance to modify Armeria to do (de)serialization in the same thread that calls the user code?

jrhee17 commented 1 year ago

I guess the main problem is the time it takes for marshaller.deserializeRequest(message, grpcWebText) in IO thread. If the deserialization process takes too long, the IO task may experience significant delays.

I think this is a valid hypothesis, although I think we probably want to do some benchmarks to confirm if this indeed is an issue 👍

In Armeria, the request process involves IO, blocking, IO, blocking (for request processing), and then IO (for response). However, in raw gRPC server, the process involves IO, blocking (for request processing), and then IO (for response).

I would actually expect Armeria to have the same pattern (i.e. IO -> blocking -> IO). (up to right before invokeOnMessage in your code snippet the request would be run in an event loop). Did you observe otherwise?

ikhoon commented 1 year ago

It would be helpful to investigate the performance issue if you could provide the frame graphs for Armeria and upstream gRPC servers.

richieyan commented 1 year ago

I guess the main problem is the time it takes for marshaller.deserializeRequest(message, grpcWebText) in IO thread. If the deserialization process takes too long, the IO task may experience significant delays.

I think this is a valid hypothesis, although I think we probably want to do some benchmarks to confirm if this indeed is an issue 👍

In Armeria, the request process involves IO, blocking, IO, blocking (for request processing), and then IO (for response). However, in raw gRPC server, the process involves IO, blocking (for request processing), and then IO (for response).

I would actually expect Armeria to have the same pattern (i.e. IO -> blocking -> IO). (up to right before invokeOnMessage in your code snippet the request would be run in an event loop). Did you observe otherwise?

IO -> blocking

// com.linecorp.armeria.server.grpc.FramedGrpcService#startCall
// EventLoop switch to blockingExecutor
    if (blockingExecutor != null) {
        blockingExecutor.execute(() -> startCall(methodDef, ctx, req, methodDescriptor, call));
    } else {
        try (SafeCloseable ignored = ctx.push()) {
            startCall(methodDef, ctx, req, methodDescriptor, call);
        }
    }

blocking -> IO

// com.linecorp.armeria.internal.server.grpc.AbstractServerCall#startDeframing

// blockingExecutor switch to EventLoop
req.aggregate(AggregationOptions.usePooledObjects(ctx.alloc(), ctx.eventLoop()))
           .handle((aggregatedHttpRequest, cause) -> {
               try {
                   onRequestMessage(requestDeframer.deframe(aggregatedHttpRequest.content()), true);
               } catch (Exception ex) {
                   onError(ex);
               }
               return null;
           });

IO -> blocking


// com.linecorp.armeria.internal.server.grpc.AbstractServerCall#onRequestMessage
// currently thread is a IO thread
final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat);
request = marshaller.deserializeRequest(message, grpcWebText);
maybeLogRequestContent(request);

if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) { GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx); }

// switch to blockingExecutor if (blockingExecutor != null) { blockingExecutor.execute(() -> invokeOnMessage(request, endOfStream)); } else { invokeOnMessage(request, endOfStream); }


4. blocking -> IO ( do respone), omit the code

@jrhee17  code snippet about the thread switches

richieyan commented 1 year ago

It would be helpful to investigate the performance issue if you could provide the frame graphs for Armeria and upstream gRPC servers.

I couldn't find a better way to do this. In my current work, I have implemented a PerfLog context to record the time taken for each step of the process in order to analyze the performance.

https://gist.github.com/richieyan/4a9381837561f761a017b0a72bf23589

And the gRPC proto structure is too complex to safely extracted to publish here.

richieyan commented 1 year ago

Thanks a lot for your valuable input, @richieyan. I agree that more frequent switches between threads can be a problem. Did you have a chance to modify Armeria to do (de)serialization in the same thread that calls the user code?

Try to fix this by moving the block to a method deserializeMessage then call it in blocking Executor. But not test it yet.

    private I deserializeMessage(DeframedMessage message) {
        final ByteBuf buf = message.buf();
        final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat);
        I request = null;
        try {
            request = marshaller.deserializeRequest(message, grpcWebText);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        maybeLogRequestContent(request);

        if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) {
            GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx);
        }
        return request;
    }

Call it in blocking Executor

            if (blockingExecutor != null) {
                blockingExecutor.execute(() -> {
                    invokeOnMessage(deserializeMessage(message), endOfStream);
                });
            } else {
                invokeOnMessage(deserializeMessage(message), endOfStream);
            }

jrhee17 commented 1 year ago

@jrhee17 code snippet about the thread switches

Thanks for the share 👍 Looks like you're right for the onMessage path.

From a quick look, I think this may also be fixable by reducing the scope of blockingTaskExecutor calls to include only ServerCallListener.Listener callbacks instead of calling startCall() from the start.

jrhee17 commented 1 year ago

Try to fix this by moving the block to a method deserializeMessage then call it in blocking Executor.

Looks like how I imagined it 👍

jrhee17 commented 1 year ago

From a quick look, I think this may also be fixable by reducing the scope of blockingTaskExecutor calls to include only ServerCallListener.Listener callbacks instead of calling startCall() from the start.

Correct me if I'm wrong, but it seems like upstream also reschedules threads this way. I assume this is because some degree of synchronization is required betwen 1) invoking the service method 2) requesting for more data.

This is from static analysis though, so I may be wrong. I'll also try doing a run this weekend (along with benchmark runs for serializing from the blocking thread)

Called from a blocking executor https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L613 https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L700-L701
Requests for data upstream

https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/stub/src/main/java/io/grpc/stub/ServerCalls.java#L134

Rescheduled to io stream

https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/AbstractStream.java#L240

richieyan commented 1 year ago

From a quick look, I think this may also be fixable by reducing the scope of blockingTaskExecutor calls to include only ServerCallListener.Listener callbacks instead of calling startCall() from the start.

Correct me if I'm wrong, but it seems like upstream also reschedules threads this way. I assume this is because some degree of synchronization is required betwen 1) invoking the service method 2) requesting for more data.

This is from static analysis though, so I may be wrong. I'll also try doing a run this weekend (along with benchmark runs for serializing from the blocking thread)

Called from a blocking executor https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L613 https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L700-L701

Requests for data upstream

https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/stub/src/main/java/io/grpc/stub/ServerCalls.java#L134

Rescheduled to io stream

https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/AbstractStream.java#L240

In the gRPC Server, the request process is split into the HTTP2 header process and the HTTP2 body process. It's true that in the Header process, the process goes from IO to blocking when the HTTP2 header read is ready. The blocking executor will add MethodLookup and HandleServerCall for the next body process.

// io.grpc.internal.ServerImpl.ServerTransportListenerImpl#streamCreatedInternal
      wrappedExecutor.execute(new MethodLookup());
      wrappedExecutor.execute(new HandleServerCall());

richieyan commented 1 year ago

Thanks a lot for your valuable input, @richieyan. I agree that more frequent switches between threads can be a problem. Did you have a chance to modify Armeria to do (de)serialization in the same thread that calls the user code?

Try to fix this by moving the block to a method deserializeMessage then call it in blocking Executor. But not test it yet.

    private I deserializeMessage(DeframedMessage message) {
        final ByteBuf buf = message.buf();
        final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat);
        I request = null;
        try {
            request = marshaller.deserializeRequest(message, grpcWebText);
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
        maybeLogRequestContent(request);

        if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) {
            GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx);
        }
        return request;
    }

Call it in blocking Executor

            if (blockingExecutor != null) {
                blockingExecutor.execute(() -> {
                    invokeOnMessage(deserializeMessage(message), endOfStream);
                });
            } else {
                invokeOnMessage(deserializeMessage(message), endOfStream);
            }

After applying this patch, the issue has been resolved in our online service, and the p95 value is no longer increasing abnormally.

jrhee17 commented 1 year ago

After applying this patch, the issue has been resolved in our online service, and the p95 value is no longer increasing abnormally.

Interesting, I didn't get a chance to run benchmarks yet. Let me do so right now

line / armeria

gRPC performance issues arise when the request is large and complex. #5152

Background

Guess of the reason