netty / netty

Netty project - an event-driven asynchronous network application framework
http://netty.io
Apache License 2.0
33.36k stars 15.9k forks source link

Memory allocation in a loop that leads to a crash #10286

Closed Entea closed 4 years ago

Entea commented 4 years ago

Hello! Thanks everyone for maintaining such a great project!

We've been using netty (via grpc-java) and have seen multiple failures of services. The failure pattern is always the same: under moderate traffic of only 1 tcp connection and 2 streams (via haproxy, but that's another story) grpc-server starts filling up direct memory and old-gen and service crumbles within few minutes.

The heap dump showed 20gb of io.netty.channel.ChannelOutboundBuffer$Entry objects being queued up, with a strange pattern: 0 bytes (EmptyByteBuf) followed by 9 byte PooledUnsafeDirectByteBuf, repeating over and over.

Looking through the source code @sergey-ushakov came up with an observation, that this could only happen if one of the buffers queued up in CoalescingBufferQueue could be somehow altered without reflecting it in #readableBytes property. Under these circumstances, io.netty.handler.codec.http2.DefaultHttp2RemoteFlowController.FlowState#writeAllocatedBytes can go into an endless loop:

                while (!cancelled && (frame = peek()) != null) {
                    int maxBytes = min(allocated, writableWindow());
                    if (maxBytes <= 0 && frame.size() > 0) {
                        break;
                    }
                    writeOccurred = true;
                    int initialFrameSize = frame.size();
                    try {
                        frame.write(ctx, max(0, maxBytes));
                        // -> Here, frame size gets stuck with a positive number, that is never changed.
                        if (frame.size() == 0) {
                            // -> remove never happens!
                            pendingWriteQueue.remove();
                            frame.writeComplete();
                        }
                    } finally {
                        // -> here, allocated is never altered, because initialFrameSize == frame.size()
                        allocated -= initialFrameSize - frame.size();
                    }
                }

Expected behavior

The http2 stream should be closed, if CoalescingBufferQueue is somehow damaged.

Actual behavior

DefaultHttp2RemoteFlowController.FlowState#writeAllocatedBytes gets stuck in an endless loop and consumes all available memory.

Steps to reproduce

The original cause is unknown (and I guess haproxy somehow sent something wrong to netty stack). The endless loop issue can be reproduced by manually incrementingCoalescingBufferQueue#readableBytes in a debugger.

Minimal yet complete reproducer code (or URL to code)

N/A

Netty version

4.1.x

JVM version (e.g. java -version)

java -version
openjdk version "11.0.6" 2020-01-14
OpenJDK Runtime Environment oracle (build 11.0.6)
OpenJDK 64-Bit Server VM oracle (build 11.0.6, mixed mode)

OS version (e.g. uname -a)

uname -a
Linux 4.9.0-8-amd64 #1 SMP Debian 4.9.110-3+deb9u6 (2018-10-08) x86_64 GNU/Linux
normanmaurer commented 4 years ago

@Entea PTAL https://github.com/netty/netty/pull/10294