Open richieyan opened 1 year ago
Thanks a lot for your valuable input, @richieyan. I agree that more frequent switches between threads can be a problem. Did you have a chance to modify Armeria to do (de)serialization in the same thread that calls the user code?
I guess the main problem is the time it takes for marshaller.deserializeRequest(message, grpcWebText) in IO thread. If the deserialization process takes too long, the IO task may experience significant delays.
I think this is a valid hypothesis, although I think we probably want to do some benchmarks to confirm if this indeed is an issue π
In Armeria, the request process involves IO, blocking, IO, blocking (for request processing), and then IO (for response). However, in raw gRPC server, the process involves IO, blocking (for request processing), and then IO (for response).
I would actually expect Armeria to have the same pattern (i.e. IO -> blocking -> IO). (up to right before invokeOnMessage
in your code snippet the request would be run in an event loop).
Did you observe otherwise?
It would be helpful to investigate the performance issue if you could provide the frame graphs for Armeria and upstream gRPC servers.
I guess the main problem is the time it takes for marshaller.deserializeRequest(message, grpcWebText) in IO thread. If the deserialization process takes too long, the IO task may experience significant delays.
I think this is a valid hypothesis, although I think we probably want to do some benchmarks to confirm if this indeed is an issue π
In Armeria, the request process involves IO, blocking, IO, blocking (for request processing), and then IO (for response). However, in raw gRPC server, the process involves IO, blocking (for request processing), and then IO (for response).
I would actually expect Armeria to have the same pattern (i.e. IO -> blocking -> IO). (up to right before
invokeOnMessage
in your code snippet the request would be run in an event loop). Did you observe otherwise?
// com.linecorp.armeria.server.grpc.FramedGrpcService#startCall
// EventLoop switch to blockingExecutor
if (blockingExecutor != null) {
blockingExecutor.execute(() -> startCall(methodDef, ctx, req, methodDescriptor, call));
} else {
try (SafeCloseable ignored = ctx.push()) {
startCall(methodDef, ctx, req, methodDescriptor, call);
}
}
// com.linecorp.armeria.internal.server.grpc.AbstractServerCall#startDeframing
// blockingExecutor switch to EventLoop
req.aggregate(AggregationOptions.usePooledObjects(ctx.alloc(), ctx.eventLoop()))
.handle((aggregatedHttpRequest, cause) -> {
try {
onRequestMessage(requestDeframer.deframe(aggregatedHttpRequest.content()), true);
} catch (Exception ex) {
onError(ex);
}
return null;
});
// com.linecorp.armeria.internal.server.grpc.AbstractServerCall#onRequestMessage
// currently thread is a IO thread
final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat);
request = marshaller.deserializeRequest(message, grpcWebText);
maybeLogRequestContent(request);
if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) { GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx); }
// switch to blockingExecutor if (blockingExecutor != null) { blockingExecutor.execute(() -> invokeOnMessage(request, endOfStream)); } else { invokeOnMessage(request, endOfStream); }
4. blocking -> IO ( do respone), omit the code
@jrhee17 code snippet about the thread switches
It would be helpful to investigate the performance issue if you could provide the frame graphs for Armeria and upstream gRPC servers.
I couldn't find a better way to do this. In my current work, I have implemented a PerfLog
context to record the time taken for each step of the process in order to analyze the performance.
https://gist.github.com/richieyan/4a9381837561f761a017b0a72bf23589
And the gRPC proto structure is too complex to safely extracted to publish here.
Thanks a lot for your valuable input, @richieyan. I agree that more frequent switches between threads can be a problem. Did you have a chance to modify Armeria to do (de)serialization in the same thread that calls the user code?
Try to fix this by moving the block to a method deserializeMessage
then call it in blocking Executor
.
But not test it yet.
private I deserializeMessage(DeframedMessage message) {
final ByteBuf buf = message.buf();
final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat);
I request = null;
try {
request = marshaller.deserializeRequest(message, grpcWebText);
} catch (IOException e) {
throw new RuntimeException(e);
}
maybeLogRequestContent(request);
if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) {
GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx);
}
return request;
}
Call it in blocking Executor
if (blockingExecutor != null) {
blockingExecutor.execute(() -> {
invokeOnMessage(deserializeMessage(message), endOfStream);
});
} else {
invokeOnMessage(deserializeMessage(message), endOfStream);
}
@jrhee17 code snippet about the thread switches
Thanks for the share π Looks like you're right for the onMessage
path.
From a quick look, I think this may also be fixable by reducing the scope of blockingTaskExecutor
calls to include only ServerCallListener.Listener
callbacks instead of calling startCall()
from the start.
Try to fix this by moving the block to a method deserializeMessage then call it in blocking Executor.
Looks like how I imagined it π
From a quick look, I think this may also be fixable by reducing the scope of blockingTaskExecutor calls to include only ServerCallListener.Listener callbacks instead of calling startCall() from the start.
Correct me if I'm wrong, but it seems like upstream also reschedules threads this way. I assume this is because some degree of synchronization is required betwen 1) invoking the service method 2) requesting for more data.
This is from static analysis though, so I may be wrong. I'll also try doing a run this weekend (along with benchmark runs for serializing from the blocking thread)
Called from a blocking executor https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L613 https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L700-L701
Requests for data upstream
From a quick look, I think this may also be fixable by reducing the scope of blockingTaskExecutor calls to include only ServerCallListener.Listener callbacks instead of calling startCall() from the start.
Correct me if I'm wrong, but it seems like upstream also reschedules threads this way. I assume this is because some degree of synchronization is required betwen 1) invoking the service method 2) requesting for more data.
This is from static analysis though, so I may be wrong. I'll also try doing a run this weekend (along with benchmark runs for serializing from the blocking thread)
- Called from a blocking executor https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L613 https://github.com/grpc/grpc-java/blob/d15daed5e471fd3cfe5157568c88a0c8b8594dc5/core/src/main/java/io/grpc/internal/ServerImpl.java#L700-L701
- Requests for data upstream
- Rescheduled to io stream
In the gRPC Server, the request process is split into the HTTP2 header process and the HTTP2 body process.
It's true that in the Header process, the process goes from IO to blocking when the HTTP2 header read is ready.
The blocking executor will add MethodLookup
and HandleServerCall
for the next body process.
// io.grpc.internal.ServerImpl.ServerTransportListenerImpl#streamCreatedInternal
wrappedExecutor.execute(new MethodLookup());
wrappedExecutor.execute(new HandleServerCall());
Thanks a lot for your valuable input, @richieyan. I agree that more frequent switches between threads can be a problem. Did you have a chance to modify Armeria to do (de)serialization in the same thread that calls the user code?
Try to fix this by moving the block to a method
deserializeMessage
then call it inblocking Executor
. But not test it yet.private I deserializeMessage(DeframedMessage message) { final ByteBuf buf = message.buf(); final boolean grpcWebText = GrpcSerializationFormats.isGrpcWebText(serializationFormat); I request = null; try { request = marshaller.deserializeRequest(message, grpcWebText); } catch (IOException e) { throw new RuntimeException(e); } maybeLogRequestContent(request); if (unsafeWrapRequestBuffers && buf != null && !grpcWebText) { GrpcUnsafeBufferUtil.storeBuffer(buf, request, ctx); } return request; }
Call it in
blocking Executor
if (blockingExecutor != null) { blockingExecutor.execute(() -> { invokeOnMessage(deserializeMessage(message), endOfStream); }); } else { invokeOnMessage(deserializeMessage(message), endOfStream); }
After applying this patch, the issue has been resolved in our online service, and the p95 value is no longer increasing abnormally.
After applying this patch, the issue has been resolved in our online service, and the p95 value is no longer increasing abnormally.
Interesting, I didn't get a chance to run benchmarks yet. Let me do so right now
Background
We encounter this problem when we want to switch our gRPC server from a raw gRPC server to an Armeria server. The main problem is that the response latency in p95 is much larger than that of the official server.
In our case, when using the raw gRPC server, the request QPS was about 10K-12K and the p95 latency was around 100ms.
However, when we switched to the armeria server, the p95 latency increased to around 300ms, despite maintaining the same QPS.
The request was for about 500K bytes or more. Then, I tested the deserialization time of the request and found it to be approximately 7 ms or more.
Another noticeable point was that the CPU ratio was lower than that of the raw gRPC server.
Guess of the reason
At first, we thought the problem was caused by the large request data, so we tried using
unsafeWrapRequestBuffers
. However, the p95 was not reduced.After extensively examining the code, I decided to investigate the deserialization process. During my investigation, I discovered differences between the deserialization methods used in raw gRPC and armeria.
In the raw gRPC server, the deserialization is being processed in the blocking executor thread.
In the armeria gRPC server, the deserialization processing occurs in the IO (EventLoop) thread.
I guess the main problem is the time it takes for
marshaller.deserializeRequest(message, grpcWebText)
in IO thread. If the deserialization process takes too long, the IO task may experience significant delays.Another point is that the Armeria gRPC server thread switch is more frequent than in raw gRPC. In Armeria, the request process involves IO, blocking, IO, blocking (for request processing), and then IO (for response). However, in raw gRPC server, the process involves IO, blocking (for request processing), and then IO (for response). I'm not sure how the thread switch will affect the amount of pending time for the task.