Make sure that we dont use instance lock as it can cause deadlocks

ConcreteCloudServices commented 7 years ago

DeadLock

We recently ran into an instance where the CloudWatchAppender was causing a deadlock on our server. This is because RootLogger also uses the CloudWatchAppender instance lock when it appends messages so you can end up in the following scenario:

[Thread1] Logs a message
[CWADaemon] Locks on the CloudWatchAppender instance to send the message
[Thread1] Logs another message. Inside of RootLogger, this grabs an internal Lock A, and then we also attempt to grab the CloudWatchAppender instance lock but can't because the daemon thread holds it
[CWADaemon] The AWSClient internally attempts to LOG.info(...) a message (in our case it was because the message was too big). Inside of RootLogger, this attempts to grab Lock A, but it can't because Thread1 holds it.

Infinite Loop

Another possible edge case (although we haven't run into this one) is that we could potentially end up in an infinite loop of trying to send "unable to send" message from the AWSClient.

As shown above, if the AWSClient internally logs a message about being unable to send, we would then attempt to re-send that message (causing another message). Our solution was to just ignore all "com.amazonaws" LoggingEvents and instead let our FileAppender handle those (if we ever need to look at them).

nemo83 commented 7 years ago

Hi,

thanks for the contribution and sorry for the issue the Appender created to you. I never considered a similar scenario.

While I understand the deadlock issue I'd be happy to try to avoid filtering out com.amazonaws logs as they are log entries users may be interested in exporting to cloudwatch.

Hope to find some time soon to go thru the code.

Thanks, Gio

ConcreteCloudServices commented 7 years ago

Hey Gio,

Thanks for the fast response! Also thank you for the contribution — it has helped us quite a bit with our latest AWS project!

Are there any other solutions that could prevent the infinite loop scenario described above (apart from blocking all com.amazonaws events)? Our worry is the following scenario:

CloudWatchClient attempts to send a logging event to CloudWatch
CloudWatchClient runs into an error during sending to AWS (connection issue, format issue, etc) and internally logs a message (as shown above, this appears to be in AmazonHTTPClient but could potentially be in other places too).
The CloudWatchClient then attempts to send it’s own internally logged message to AWS, but again runs into an error during sending and internally logs a new message
And we repeatedly attempt to send the CloudWatchClient’s internally logged message only to generate a new error message

Thanks again!

nemo83 commented 7 years ago

Hi,

I'm taking a look at the PR. Currently I can see 2 issues: 1) DeadLock: I'm pretty sure as you correctly pointed out that, onClose if the there are messages in the queue, and we fail to send them and there aws classes are printing error logs, the while will always return true and fall in an infinite loop w/o every crashing .... so the possible solution there is to replace the while with an if so that, upon closing the Appender we only attempt to send the remainder of the messages once. Also, I would add a check in the append so that if the Appender is closed messages are discarded. 2) Not sure about the very last case you guys pointed out. If you're using the CW appender is because you want these LogEvents to be pushed to CW logs. So I would not protect against this. First I've noticed the Appender's` daemon only sleeps 20 ms and regardless from how many messages are in the queue it performs a send. I would increase the sleep and / or wait for at least 30% to 50% of the buffer to be full before sending to CW. Further upon an error thrown from CW I would increase the sleep time esponencially.

Thoughts?

ConcreteCloudServices commented 7 years ago

Hey Gio,

DeadLock

For the dead lock issue, it was actually experienced while our server was running and not during shutdown. This is the following deadlock stack trace from jstack -l <pid> during the deadlock (this command tells the JVM to dump all of the threads -- sorry for the formatting, I couldn't get GitHub to do new lines correctly).

This is our "Thread1" from above. As you can see, it attempts to log a message, grabs a lock inside of RootLogger (0x0000000765f7a030), and then is waiting to grab the CloudWatchAppender instance lock (0x0000000767aa77d8)

"Timer-0": at org.apache.log4j.AppenderSkeleton.doAppend(AppenderSkeleton.java:231) waiting to lock <0x0000000767aa77d8> (a com.concretesoftware.service.aws.CloudWatchAppender) at org.apache.log4j.helpers.AppenderAttachableImpl.appendLoopOnAppenders(AppenderAttachableImpl.java:66) at org.apache.log4j.Category.callAppenders(Category.java:206) locked <0x0000000765f7a030> (a org.apache.log4j.spi.RootLogger) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.info(Category.java:666) at com.concretesoftware.sessionController.system.CommonCachingService.run(CommonCachingService.java:184)

This is the "CWADaemon" thread as described above (note that some line numbers may not match as we made some additional internal changes to the CloudWatchAppender). As you can see here, we have the CloudWatchAppender instance lock (0x0000000767aa77d8), but as we are holding it the AmazonHttpClient wants to log a message. This requires the lock inside of RootLogger which the other thread already has (0x0000000765f7a030).

"Thread-17": at org.apache.log4j.Category.callAppenders(Category.java:204) waiting to lock <0x0000000765f7a030> (a org.apache.log4j.spi.RootLogger) at org.apache.log4j.Category.forcedLog(Category.java:391) at org.apache.log4j.Category.log(Category.java:856) at org.apache.commons.logging.impl.Log4JLogger.info(Log4JLogger.java:211) at com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:448) at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:292) at com.amazonaws.services.logs.AWSLogsClient.invoke(AWSLogsClient.java:1117) at com.amazonaws.services.logs.AWSLogsClient.putLogEvents(AWSLogsClient.java:769) at com.concretesoftware.service.aws.CloudWatchAppender.sendToCloudWatch(CloudWatchAppender.java:167) at com.concretesoftware.service.aws.CloudWatchAppender.sendMessages(CloudWatchAppender.java:138) locked <0x0000000767aa77d8> (a com.concretesoftware.service.aws.CloudWatchAppender) at com.concretesoftware.service.aws.CloudWatchAppender.access$100(CloudWatchAppender.java:36) at com.concretesoftware.service.aws.CloudWatchAppender$1.run(CloudWatchAppender.java:221) at java.lang.Thread.run(Thread.java:724)

So to answer your question of "how does this differ at the method synchronization level", the answer is that a "public synchronized sendMessage(...)" locks the CloudWatchAppender instance, which is exactly what the RootLogger is also locking on.

The change that we submitted instead locks on a different object (sendMessagesLock) which should not conflict with the locking done inside of the RootLogger.

Infinite Loop

I think our main concern is that we would never be able to send the AWSHttpClient's error log message no matter how long we waited (perhaps due to a formatting error or some other unknown circumstance). So even if we added an exponential wait, it would always cause the entire batch to fail to send (and since we re-add it every time, would cause all future batches to fail too).

This is an unproven concern and, as you pointed out, perhaps isn't appropriate to put into the public repository for everyone. In our use case we just wanted to make sure that this potential risk did not cause the appender to fail to send log message on our servers.

Thanks Gio for looking over our code and the feedback! It's much appreciated!

nemo83 commented 7 years ago

Hi,

thanks for all the additional details provided. Thanks for the insight on the synchronized method.

I'm happy to merge the PR with the deadlock fix. I'm not sure about the following if:

if (event.getLocationInformation().getClassName().startsWith("com.amazonaws")) {

You can very elegantly replace it with a 1-line on the configuration, and still allow for ERRORS to be sent to CW.

log4j.appender.DW.com.amazonaws=ERROR (or off)
log4j.rootLogger=ERROR, stdout

This could be added in the documentation (feel free to add it or I will add it myself.) as a GOTCHA and allow to still propagate.

Is that ok?

ConcreteCloudServices commented 7 years ago

Hey Gio,

Thanks for taking a look! I've pushed a change that removed the: if (event.getLocationInformation().getClassName().startsWith("com.amazonaws")) {

I also pushed one more fix to an issue that we've been seeing when running the appender. The InputLogEvents that we send in CloudWatchClient.putLogEvents(...) must be in chronological order by timestamp. A LinkedBlockingQueue, however, does not guarantee the order of the input log events so it's possible for the logging events inside of our loggingEventsQueue to not be in order resulting in the following exception:

com.amazonaws.services.logs.model.InvalidParameterException: Log events in a single PutLogEvents request must be in chronological order.

To fix this we've added a sort on line 114: .sorted((e1, e2) -> e1.getTimestamp().compareTo(e2.getTimestamp()))

The documentation change you mentioned sounds good.

Thanks again!

speedwing / log4j-cloudwatch-appender