RabbitMQ UnexpectedConnectionDriverException leads to the deadlock

AleAndForCode commented 2 years ago

Then the AMQP Connection thread handles an UnexpectedConnectionDriverException, a MissedHeartbeatException for example, it tries to log the error message. In the process of logging, this thread takes lock on GelfAppender monitor and starts to wait for confirms on the unresponsive rmq connection without timeout. All others' application threads became BLOCKED when they are trying to log something via GelfAppender.

One of my unresponsive spring apps thread dump parts:

"AMQP Connection 10.3.1.123:5672" #16 prio=5 os_prio=0 cpu=9786.67ms elapsed=511530.01s tid=0x00007fde410a3000 nid=0x17 in Object.wait()  [0x00007fde037fa000]
   java.lang.Thread.State: WAITING (on object monitor)
    at java.lang.Object.wait(java.base@11.0.4/Native Method)
    - waiting on <no object reference available>
    at java.lang.Object.wait(java.base@11.0.4/Object.java:328)
    at com.rabbitmq.client.impl.ChannelN.waitForConfirms(ChannelN.java:218)
    - waiting to re-lock in wait() <0x00000000e110d400> (a java.util.Collections$SynchronizedSortedSet)
    at com.rabbitmq.client.impl.ChannelN.waitForConfirms(ChannelN.java:195)
    at com.rabbitmq.client.impl.recovery.AutorecoveringChannel.waitForConfirms(AutorecoveringChannel.java:691)
    at com.github.pukkaone.gelf.protocol.GelfAMQPSender.sendMessage(GelfAMQPSender.java:111)
    at com.github.pukkaone.gelf.logback.GelfAppender.append(GelfAppender.java:298)
    at com.github.pukkaone.gelf.logback.GelfAppender.append(GelfAppender.java:24)
    at ch.qos.logback.core.AppenderBase.doAppend(AppenderBase.java:82)
    - locked <0x00000000e0fc9610> (a com.github.pukkaone.gelf.logback.GelfAppender)
    at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
    at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
    at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
    at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
    at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
    at ch.qos.logback.classic.Logger.error(Logger.java:538)
    at com.rabbitmq.client.impl.ForgivingExceptionHandler.log(ForgivingExceptionHandler.java:119)
    at com.rabbitmq.client.impl.ForgivingExceptionHandler.handleUnexpectedConnectionDriverException(ForgivingExceptionHandler.java:36)
    at com.rabbitmq.client.impl.AMQConnection.handleFailure(AMQConnection.java:777)
    at com.rabbitmq.client.impl.AMQConnection.access$400(AMQConnection.java:47)
    at com.rabbitmq.client.impl.AMQConnection$MainLoop.run(AMQConnection.java:674)
    at java.lang.Thread.run(java.base@11.0.4/Thread.java:834)

"http-nio-8080-exec-11" #134 daemon prio=5 os_prio=0 cpu=10.79ms elapsed=2538.77s tid=0x00007fddd4012000 nid=0x8c waiting for monitor entry  [0x00007fddc57c4000]
   java.lang.Thread.State: BLOCKED (on object monitor)
    at ch.qos.logback.core.AppenderBase.doAppend(AppenderBase.java:63)
    - waiting to lock <0x00000000e0fc9610> (a com.github.pukkaone.gelf.logback.GelfAppender)
    at ch.qos.logback.core.spi.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:51)
    at ch.qos.logback.classic.Logger.appendLoopOnAppenders(Logger.java:270)
    at ch.qos.logback.classic.Logger.callAppenders(Logger.java:257)
    at ch.qos.logback.classic.Logger.buildLoggingEventAndAppend(Logger.java:421)
    at ch.qos.logback.classic.Logger.filterAndLog_0_Or3Plus(Logger.java:383)
    at ch.qos.logback.classic.Logger.log(Logger.java:765)
    at org.apache.logging.slf4j.SLF4JLogger.logMessage(SLF4JLogger.java:232)
    at org.jboss.logging.Log4j2Logger.doLog(Log4j2Logger.java:54)
    at org.jboss.logging.Logger.logv(Logger.java:2226)

(one of 24 similar threads)

appender conf:

    <appender name="GELF-AMQP" class="com.github.pukkaone.gelf.logback.GelfAppender">
        <amqpURI>${AMQP_URI}</amqpURI>
        <amqpExchange>${AMQP_KEY}</amqpExchange>
        <amqpRoutingKey>${AMQP_KEY}</amqpRoutingKey>
        <amqpMaxRetries>5</amqpMaxRetries>
        <amqpAutomaticRecovery>true</amqpAutomaticRecovery>
        <originHost>localhost</originHost>
        <facility>gelf-java</facility>
        <additionalField>application=${APP_NAME}</additionalField>
        <mdcIncluded>true</mdcIncluded>
    </appender>

logback-gelf 1.1.11

pukkaone commented 2 years ago

What version of the com.rabbitmq:amqp-client dependency are you running? Version 4.0.0 introduced a change to ForgivingExceptionHandler.java to log the error. In previous versions, it did nothing with the error. As a work around, can you downgrade the amqp-client version?

AleAndForCode commented 2 years ago

I'm running on amqp-client:5.12.0. As a temporary workaround, I replaced the GelfAMQPSender class with the Channel#waitForConfirms(long timeout) call instead of the Channel#waitForConfirms() one.

...
                BasicProperties properties = propertiesBuilder.build();

                channel.basicPublish(
                        exchangeName,
                        routingKey,
                        properties,
                        toAMQPBuffer(message.toJson()).array());
                channel.waitForConfirms(waitForConfirmsTimeout); // trying to prevent deadlock when the AMQP Connection thread logs the MissedHeartbeatException

                return true;
...

If it's correct solution, I can make PR with the waitForConfirmsTimeout config parameter for the GelfAppender.

pukkaone commented 2 years ago

I'm surprised your workaround mitigates the issue. It's a fundamental design flaw. The logger invokes amqp-client to send messages. The amqp-client invokes the logger to log errors. This circular dependency leads to the deadlock.

Another possible solution is configure logback to not send log events from amqp-client to the GelfAppender.

AleAndForCode commented 2 years ago

I suppose that deadlock can occur when the app tries to log any message via amqp connection with an unresponsive peer. The waitForConfirms() or the waitForConfirms(0) methods call Object#wait() and thread may wait forever (except spurious wakeup). If log events from amqp-client aren't send to the GelfAppender and amqp connection became disrupted, I'm not sure that any next GelfAppender sending will not lead to the same issue.

The amqp-client invokes the logger to log MissedHeartbeatException exceptions from another tread.

If the waitForConfirmsTimeout expires, a TimeoutException will be thrown. That will recreate the channel, connection and increments tries. Finally, the connection will be recovered or we will get the ErrorStatus.

pukkaone / logback-gelf

RabbitMQ UnexpectedConnectionDriverException leads to the deadlock #21