osiegmar / logback-gelf

Logback appender for sending GELF messages with zero additional dependencies.
GNU Lesser General Public License v2.1
218 stars 58 forks source link

Some messages are lost #51

Closed tarasrng closed 4 years ago

tarasrng commented 4 years ago

I'm using the latest logback-gelf appender (3.0.0) and also tried the old one (2.2.0) with Spring Boot 2.2.2.RELEASE (web app)

Not all the messages are reaching Graylog, and when the app is under heavy load almost none of them are logged.

When enabling logback debug I see the following errors very often:

ERROR in de.siegmar.logbackgelf.GelfTcpAppender[GELF] - Error sending message via tcp://my host:12201 java.net.SocketException: Broken pipe (Write failed)
    at java.net.SocketException: Broken pipe (Write failed)
    at  at java.base/java.net.SocketOutputStream.socketWrite0(Native Method)
    at  at java.base/java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:110)
    at  at java.base/java.net.SocketOutputStream.write(SocketOutputStream.java:138)
    at  at de.siegmar.logbackgelf.TcpConnection.write(TcpConnection.java:55)
    at  at de.siegmar.logbackgelf.GelfTcpAppender$2.accept(GelfTcpAppender.java:176)
    at  at de.siegmar.logbackgelf.GelfTcpAppender$2.accept(GelfTcpAppender.java:173)
    at  at de.siegmar.logbackgelf.pool.SimpleObjectPool.execute(SimpleObjectPool.java:63)
    at  at de.siegmar.logbackgelf.GelfTcpAppender.sendMessage(GelfTcpAppender.java:173)
    at  at de.siegmar.logbackgelf.GelfTcpAppender.appendMessage(GelfTcpAppender.java:147)
    at  at de.siegmar.logbackgelf.AbstractGelfAppender.append(AbstractGelfAppender.java:104)
    at  at de.siegmar.logbackgelf.AbstractGelfAppender.append(AbstractGelfAppender.java:27)
    at  at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
    at  at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
    at  at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:290)

Also, there are the following errors:

ERROR in de.siegmar.logbackgelf.GelfTcpAppender[GELF] - Error sending message via tcp://my host:12201 java.net.SocketException: An established connection was aborted by the software in your host machine
    at java.net.SocketException: An established connection was aborted by the software in your host machine
    at  at java.base/sun.nio.ch.NioSocketImpl.implWrite(NioSocketImpl.java:421)
    at  at java.base/sun.nio.ch.NioSocketImpl.write(NioSocketImpl.java:441)
    at  at java.base/sun.nio.ch.NioSocketImpl$2.write(NioSocketImpl.java:825)
    at  at java.base/java.net.Socket$SocketOutputStream.write(Socket.java:1007)
    at  at java.base/java.io.OutputStream.write(OutputStream.java:122)
    at  at de.siegmar.logbackgelf.TcpConnection.write(TcpConnection.java:55)
    at  at de.siegmar.logbackgelf.GelfTcpAppender.lambda$sendMessage$1(GelfTcpAppender.java:161)
    at  at de.siegmar.logbackgelf.pool.SimpleObjectPool.execute(SimpleObjectPool.java:63)
    at  at de.siegmar.logbackgelf.GelfTcpAppender.sendMessage(GelfTcpAppender.java:161)
    at  at de.siegmar.logbackgelf.GelfTcpAppender.appendMessage(GelfTcpAppender.java:135)
    at  at de.siegmar.logbackgelf.AbstractGelfAppender.append(AbstractGelfAppender.java:104)
    at  at de.siegmar.logbackgelf.AbstractGelfAppender.append(AbstractGelfAppender.java:27)
    at  at ch.qos.logback.core.UnsynchronizedAppenderBase.doAppend(UnsynchronizedAppenderBase.java:84)
    at  at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
    at  at ch.qos.logback.core.AsyncAppenderBase$Worker.run(AsyncAppenderBase.java:290)

Here is my config

<appender name="GELF" class="de.siegmar.logbackgelf.GelfTcpAppender">
        <graylogHost>${MY_GRAYLOG_HOST}</graylogHost>
        <graylogPort>12201</graylogPort>
        <connectTimeout>15000</connectTimeout>
        <reconnectInterval>300</reconnectInterval>
        <maxRetries>2</maxRetries>
        <retryDelay>3000</retryDelay>
        <poolSize>2</poolSize>
        <poolMaxWaitTime>5000</poolMaxWaitTime>
        <encoder class="de.siegmar.logbackgelf.GelfEncoder">
            <originHost>localhost</originHost>
            <includeMdcData>true</includeMdcData>
            <shortPatternLayout class="ch.qos.logback.classic.PatternLayout">
                <pattern>%m%nopex</pattern>
            </shortPatternLayout>
            <fullPatternLayout class="ch.qos.logback.classic.PatternLayout">
                <pattern>%m%n</pattern>
            </fullPatternLayout>
            <staticField>application:${application}</staticField>
        </encoder>
    </appender>

    <appender name="ASYNC_GELF" class="ch.qos.logback.classic.AsyncAppender">
        <appender-ref ref="GELF" />
    </appender>

    <root level="info">
        <appender-ref ref="ASYNC_GELF" />
        <appender-ref ref="Console" />
    </root>
osiegmar commented 4 years ago

Hello @tarasrng !

Please check AsyncAppender documentation about how to configure queueSize and discardingThreshold regarding dropped messages in heavy load situations.

As the error message from your stack trace 'An established connection was aborted by the software in your host machine' says, it is because of a closed TCP connection. Maybe the connection has been closed by Graylog or a firewall. Depending on the configuration of this library, a retry operation is being performed afterwards with another connection but that retry could end up with a similar exception if this other connection also has been closed.

You could try the yet unreleased version of this library and configure the newly poolMaxIdleTime to automatically close connections after some time of inactivity. I think, this would probably solve your problem. See #49 for more details about this.

Let me know, if that helped and the value you had to use.

tarasrng commented 4 years ago

@osiegmar thank you for suggestions! I tried configuring queueSizeand discardingThreshold- unfortunately, no luck. Not sure how poolMaxIdleTime can help, it looks more like a cleanup parameter. Furthermore, I see lost messages even when running a short test app that runs only ~10 seconds. Maybe retry configuration has to be tuned. We will also try to check Graylog and monitor traffic using WireShark.

osiegmar commented 4 years ago

Indeed, poolMaxIdleTime probably wouldn't help for that short lifetime. If the problem also appear on an application that runs for ~10 seconds, tcpdump or Wireshark are definitely the tools I'd use for debugging. Just another thought – you're shutting down the application gracefully, right? You may check the documentation about stopping logback-classic for additional (debugging) steps in order to ensure graceful shutdown.

tarasrng commented 4 years ago

Yep, it's a graceful shutdown, and I even do sleep to make sure all the messages have a chance to be logged.

osiegmar commented 4 years ago

I'll close this ticket now as I highly doubt that it is related to this library itself. Please re-open it if you have additional information.