Hello! Thanks everyone for maintaining such a great project!
We've been using netty (via grpc-java) and have seen multiple failures of services. The failure pattern is always the same: under moderate traffic of only 1 tcp connection and 2 streams (via haproxy, but that's another story) grpc-server starts filling up direct memory and old-gen and service crumbles within few minutes.
The heap dump showed 20gb of io.netty.channel.ChannelOutboundBuffer$Entry objects being queued up, with a strange pattern: 0 bytes (EmptyByteBuf) followed by 9 byte PooledUnsafeDirectByteBuf, repeating over and over.
Looking through the source code @sergey-ushakov came up with an observation, that this could only happen if one of the buffers queued up in CoalescingBufferQueue could be somehow altered without reflecting it in #readableBytes property. Under these circumstances, io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController.FlowState#writeAllocatedBytes can go into an endless loop:
while (!cancelled && (frame = peek()) != null) {
int maxBytes = min(allocated, writableWindow());
if (maxBytes <= 0 && frame.size() > 0) {
break;
}
writeOccurred = true;
int initialFrameSize = frame.size();
try {
frame.write(ctx, max(0, maxBytes));
// -> Here, frame size gets stuck with a positive number, that is never changed.
if (frame.size() == 0) {
// -> remove never happens!
pendingWriteQueue.remove();
frame.writeComplete();
}
} finally {
// -> here, allocated is never altered, because initialFrameSize == frame.size()
allocated -= initialFrameSize - frame.size();
}
}
Expected behavior
The http2 stream should be closed, if CoalescingBufferQueue is somehow damaged.
Actual behavior
DefaultHttp2RemoteFlowController.FlowState#writeAllocatedBytes gets stuck in an endless loop and consumes all available memory.
Steps to reproduce
The original cause is unknown (and I guess haproxy somehow sent something wrong to netty stack).
The endless loop issue can be reproduced by manually incrementingCoalescingBufferQueue#readableBytes in a debugger.
Minimal yet complete reproducer code (or URL to code)
N/A
Netty version
4.1.x
JVM version (e.g. java -version)
java -version
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment oracle (build 11.0.6)
OpenJDK 64-Bit Server VM oracle (build 11.0.6, mixed mode)
OS version (e.g. uname -a)
uname -a
Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64 GNU/Linux
Hello! Thanks everyone for maintaining such a great project!
We've been using netty (via grpc-java) and have seen multiple failures of services. The failure pattern is always the same: under moderate traffic of only 1 tcp connection and 2 streams (via haproxy, but that's another story) grpc-server starts filling up direct memory and old-gen and service crumbles within few minutes.
The heap dump showed 20gb of
io.netty.channel.ChannelOutboundBuffer$Entry
objects being queued up, with a strange pattern: 0 bytes (EmptyByteBuf
) followed by 9 bytePooledUnsafeDirectByteBuf
, repeating over and over.Looking through the source code @sergey-ushakov came up with an observation, that this could only happen if one of the buffers queued up in
CoalescingBufferQueue
could be somehow altered without reflecting it in #readableBytes property. Under these circumstances,io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController.FlowState#writeAllocatedBytes
can go into an endless loop:Expected behavior
The http2 stream should be closed, if
CoalescingBufferQueue
is somehow damaged.Actual behavior
DefaultHttp2RemoteFlowController.FlowState#writeAllocatedBytes
gets stuck in an endless loop and consumes all available memory.Steps to reproduce
The original cause is unknown (and I guess haproxy somehow sent something wrong to netty stack). The endless loop issue can be reproduced by manually incrementing
CoalescingBufferQueue#readableBytes
in a debugger.Minimal yet complete reproducer code (or URL to code)
N/A
Netty version
4.1.x
JVM version (e.g.
java -version
)OS version (e.g.
uname -a
)