mp911de / logstash-gelf

Graylog Extended Log Format (GELF) implementation in Java for all major logging frameworks: log4j, log4j2, java.util.logging, logback, JBossAS7 and WildFly 8-12
http://logging.paluch.biz
MIT License
425 stars 102 forks source link

GelfTCPSender#write burns CPU in case of a slow graylog server #248

Closed netudima closed 4 years ago

netudima commented 4 years ago

logstash-gelf version: 1.13.0 OS: Linux, 4.17.2-1.el7.elrepo.x86_64

If a server side for a GELF TCP connection is slow for some reason (for example - it is overloaded with some other apps) - written data fills TCP buffers of server and receiver for the TCP connection. TCP back-pressure is active and the sender cannot send data to its socket. In case of NIO non-blocked socket channel().write() will return 0.

GelfTCPSender#write has the following logic:

protected void write(ByteBuffer buffer) throws IOException {

    while (buffer.hasRemaining()) {
        int written = channel().write(buffer);

        if (written < 0 || !isConnected()) {
            // indicator the socket was closed
            Closer.close(channel());
            throw new SocketException("Cannot write buffer to channel");
        }
    }
}

For the described use case - the logic actually works as a spin loop which burns CPU on the sender host until the graylog server will not read data on its side of TCP connection. Some backoff is needed here if written == 0.

mp911de commented 4 years ago

Our implementation is based on a best-effort implementation. Typically, NIO implementations allow for buffering and don't block the calling thread. We didn't want to rebuild what netty provides. Feel free to come up with a proposal.

We could provide various Backofen strategies (fixed, exponential), not sure we need backoff considering jitter.

netudima commented 4 years ago

Yes, understand, no concerns here. Unfortunately we have faced the issue on some of our environments several times when due to the described reason several applications start to consume 2-3x CPU more compared to a normal mode for ~ 10 minutes. Async profiler cpu profile captured at that time shows what biz/paluch/logging/gelf/intern/sender/GelfTCPSender.write spends 73% of total CPU usage. The apps did not have a huge log sent rate, based on file logs printed in parallel by a file appender we logged only 48 messages during this time period. "deliveryAttempts" parameter of GELF appender has a default value = 1.

Feel free to come up with a proposal.

yes, I will prepare a MR with a possible way to add a backoff strategy.

The flamegraph from async profiler:

async_profile_cpu_gelf_sender
mp911de commented 4 years ago

That's in place now.