Open AleAndForCode opened 2 years ago
What version of the com.rabbitmq:amqp-client dependency are you running? Version 4.0.0 introduced a change to ForgivingExceptionHandler.java to log the error. In previous versions, it did nothing with the error. As a work around, can you downgrade the amqp-client version?
I'm running on amqp-client:5.12.0. As a temporary workaround, I replaced the GelfAMQPSender class with the Channel#waitForConfirms(long timeout)
call instead of the Channel#waitForConfirms()
one.
...
BasicProperties properties = propertiesBuilder.build();
channel.basicPublish(
exchangeName,
routingKey,
properties,
toAMQPBuffer(message.toJson()).array());
channel.waitForConfirms(waitForConfirmsTimeout); // trying to prevent deadlock when the AMQP Connection thread logs the MissedHeartbeatException
return true;
...
If it's correct solution, I can make PR with the waitForConfirmsTimeout
config parameter for the GelfAppender.
I'm surprised your workaround mitigates the issue. It's a fundamental design flaw. The logger invokes amqp-client to send messages. The amqp-client invokes the logger to log errors. This circular dependency leads to the deadlock.
Another possible solution is configure logback to not send log events from amqp-client to the GelfAppender.
I suppose that deadlock can occur when the app tries to log any message via amqp connection with an unresponsive peer. The waitForConfirms()
or the waitForConfirms(0)
methods call Object#wait()
and thread may wait forever (except spurious wakeup). If log events from amqp-client aren't send to the GelfAppender and amqp connection became disrupted, I'm not sure that any next GelfAppender sending will not lead to the same issue.
The amqp-client invokes the logger to log MissedHeartbeatException exceptions from another tread.
If the waitForConfirmsTimeout
expires, a TimeoutException will be thrown. That will recreate the channel, connection and increments tries. Finally, the connection will be recovered or we will get the ErrorStatus.
Then the AMQP Connection thread handles an UnexpectedConnectionDriverException, a MissedHeartbeatException for example, it tries to log the error message. In the process of logging, this thread takes lock on GelfAppender monitor and starts to wait for confirms on the unresponsive rmq connection without timeout. All others' application threads became BLOCKED when they are trying to log something via GelfAppender.
One of my unresponsive spring apps thread dump parts:
(one of 24 similar threads)
appender conf:
logback-gelf 1.1.11