kdgregory / log4j-aws-appenders

Appenders for Log4J 1.2.x, Log4J 2.x, and Logback that write to AWS destinations.
Apache License 2.0
67 stars 18 forks source link

Repeated failures in CloudWatchAppenderIntegrationTest.testMultipleThreadsMultipleAppendersSameDestination() #184

Closed kdgregory closed 10 months ago

kdgregory commented 1 year ago

This issue exists to document what I believe to be a failure of the AWS SDK to properly retrieve messages (will be confirming with another support ticket).

This test will occasionally fail with an assertion error: "number of events in AppenderTest expected:<20000> but was:<19998>" (the actual number may vary).

When this happens, if I use my logs_reader.py utility to read the log entries, I discover that some of the early entries are missing. However, exporting the log group to S3 creates a file with all expected records. And CloudWatch Logs Insights shows the proper count, and will find the messages that logs_reader.py failed to retrieve (although see first comment below).

This does not appear to be an issue with reading the log stream too recently after writing the events (ie, eventual consistency), because the same set of events come back even if I read them days later.

The comments keep track of the things that I've tried.

kdgregory commented 1 year ago

First, I discovered that CloudWatch Logs Insights will not find messages whose timestamp is before the log group creation time. Or, as described in the AWS response to the support ticket that I raised:

You were not able to query these log event due to internally as per the design, its limitation of the CloudWatch Log Insights can not query Log Events that have Timestamps that are before the Timestamp of the Log Group Creation Time. The internal team is working on for a fix in the future. Currently, there is no ETA when it will be fixed. I have added +1 on this issue, and hopefully it will be fixed as soon as possible.

In the case of the integration test, all of the events had timestamps before the log group was created, so Insights said that there were no events in the log group.

After pre-creating the log group, this problem went away. And I was able to successfully retrieve the events via Insights that I couldn't retrieve via GetLogEvents. So at least I could confirm that the appender was doing the right thing.

kdgregory commented 1 year ago

Before I was able to verify that the log events are in fact readable depending on how you do it, I tried the following things to see if it was an issue with my appenders:

I also wrote a program to slam Logs with concurrent requests containing several thousand records. I did not see any message drops (nor did I see any evidence of throttling, even with 32 concurrent threads!).

kdgregory commented 1 year ago

Changes to the integration tests themselves:

kdgregory commented 1 year ago

I was able to replicate with a test program: it happens when concurrent threads attempt to create the same log group, and one (or more) of them get OperationAbortedException. The API doc doesn't give any indication that this would be an issue, and the exception message itself says "A conflicting operation is currently in progress against this resource. Please try again."

I had raised an issue with AWS. They had acknowledged that the retrieval wasn't working, but couldn't come up with a way to re-create, so the issue was in limbo ... hopefully with this example program it will move forward now.

I think this is enough of an edge case that it won't prevent me from releasing the new library update. I'll document the issue and recommend pre-creating the log group if there are concurrent writers.

kdgregory commented 1 year ago

After several twists and turns, in which I discovered that Support had not, actually, verified my statement because they didn't have permission to access the log data and didn't ask me to grant it, the ticket has landed with someone from the CloudWatch team who was able to reproduce from my gist.

From the ticket:

I can acknowledge that I do see an issue such that GetLogEvents API doesn't appear to be returning all the logs available in the Log Group / Stream, as of now I'm still investigating this behaviour and will update you at a later time.

kdgregory commented 1 year ago

Re-opening, because I'm still trying to figure this out, as is AWS.

I not longer believe that OperationAbortedException is the cause. I modified the integration test to pre-create both log group and log stream, and still run into the problem. What follows is the log from running the Logback version of the integration tests, with full debugging enabled:

[INFO] Running com.kdgregory.logback.aws.CloudWatchAppenderIntegrationTest
13:52:03.704 INFO  - testMultipleThreadsMultipleAppendersSameDestination: starting
13:52:03.706 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: deleting log group AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04.213 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: pre-creating log group
13:52:04.213 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: creating log group AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04.312 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: waiting for group AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination to be available
13:52:04.342 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: pre-creating log streams
13:52:04.342 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: creating log stream AppenderTest
13:52:04.375 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: waiting for stream AppenderTest to be available
13:52:04,502 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
13:52:04,502 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [console]
13:52:04,503 |-INFO in ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] property
13:52:04,504 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.kdgregory.logback.aws.CloudWatchAppender]
13:52:04,510 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [test1]
13:52:04,521 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - log writer starting (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-1)
13:52:04,522 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - checking for existence of CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,522 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.kdgregory.logback.aws.CloudWatchAppender]
13:52:04,522 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [test2]
13:52:04,527 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.kdgregory.logback.aws.CloudWatchAppender]
13:52:04,528 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [test3]
13:52:04,536 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - log writer starting (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-2)
13:52:04,536 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - checking for existence of CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,537 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.kdgregory.logback.aws.CloudWatchAppender]
13:52:04,537 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [test4]
13:52:04,541 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - About to instantiate appender of type [com.kdgregory.logback.aws.CloudWatchAppender]
13:52:04,541 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - Naming appender as [test5]
13:52:04,542 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - log writer starting (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-4)
13:52:04,543 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - checking for existence of CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,550 |-INFO in ch.qos.logback.classic.joran.action.RootLoggerAction - Setting level of ROOT logger to INFO
13:52:04,551 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [console] to Logger[ROOT]
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [com.kdgregory] to DEBUG
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [TestLogger1] to DEBUG
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting additivity of logger [TestLogger1] to false
13:52:04,551 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [test1] to Logger[TestLogger1]
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [TestLogger2] to DEBUG
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting additivity of logger [TestLogger2] to false
13:52:04,551 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [test2] to Logger[TestLogger2]
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [TestLogger3] to DEBUG
13:52:04,551 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting additivity of logger [TestLogger3] to false
13:52:04,551 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [test3] to Logger[TestLogger3]
13:52:04,552 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [TestLogger4] to DEBUG
13:52:04,552 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting additivity of logger [TestLogger4] to false
13:52:04,552 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [test4] to Logger[TestLogger4]
13:52:04,552 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting level of logger [TestLogger5] to DEBUG
13:52:04,552 |-INFO in ch.qos.logback.classic.joran.action.LoggerAction - Setting additivity of logger [TestLogger5] to false
13:52:04,552 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [test5] to Logger[TestLogger5]
13:52:04,552 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
13:52:04,552 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@31d0e481 - Registering current configuration as safe fallback point
13:52:04,554 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - log writer starting (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-5)
13:52:04,554 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - checking for existence of CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04.565 DEBUG - : writer 1 writing 1000 messages
13:52:04.580 DEBUG - : writer 2 writing 1000 messages
13:52:04,586 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - log writer starting (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-3)
13:52:04,587 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - checking for existence of CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04.590 DEBUG - : writer 3 writing 1000 messages
13:52:04.607 DEBUG - : writer 4 writing 1000 messages
13:52:04.627 DEBUG - : writer 5 writing 1000 messages
13:52:04.632 DEBUG - : writer 6 writing 1000 messages
13:52:04.641 DEBUG - : writer 7 writing 1000 messages
13:52:04.647 DEBUG - : writer 9 writing 1000 messages
13:52:04,656 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - using existing CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,656 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - checking for existence of CloudWatch log stream: AppenderTest
13:52:04.670 DEBUG - : writer 10 writing 1000 messages
13:52:04,675 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - using existing CloudWatch log stream: AppenderTest
13:52:04.688 DEBUG - : writer 11 writing 1000 messages
13:52:04.696 DEBUG - : writer 12 writing 1000 messages
13:52:04,700 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - using existing CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,700 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - checking for existence of CloudWatch log stream: AppenderTest
13:52:04,705 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - using existing CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,706 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - checking for existence of CloudWatch log stream: AppenderTest
13:52:04,707 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - using existing CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,707 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - checking for existence of CloudWatch log stream: AppenderTest
13:52:04,714 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - using existing CloudWatch log stream: AppenderTest
13:52:04.717 DEBUG - : writer 13 writing 1000 messages
13:52:04.725 DEBUG - : writer 14 writing 1000 messages
13:52:04.726 DEBUG - : writer 15 writing 1000 messages
13:52:04,726 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - using existing CloudWatch log group: AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination
13:52:04,726 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - checking for existence of CloudWatch log stream: AppenderTest
13:52:04.736 DEBUG - : writer 16 writing 1000 messages
13:52:04,743 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - using existing CloudWatch log stream: AppenderTest
13:52:04,743 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - using existing CloudWatch log stream: AppenderTest
13:52:04.751 DEBUG - : writer 17 writing 1000 messages
13:52:04,757 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - using existing CloudWatch log stream: AppenderTest
13:52:04.762 DEBUG - : writer 18 writing 1000 messages
13:52:04.773 DEBUG - : writer 19 writing 1000 messages
13:52:04,807 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - log writer initialization complete (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-2)
13:52:04.807 DEBUG - : writer 20 writing 1000 messages
13:52:04,810 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - log writer initialization complete (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-3)
13:52:04,812 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - log writer initialization complete (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-5)
13:52:04,819 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - log writer initialization complete (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-4)
13:52:04,819 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - log writer initialization complete (thread: com-kdgregory-aws-logwriter-logback-cloudwatch-1)
13:52:04.648 DEBUG - : writer 8 writing 1000 messages
13:52:04.830 INFO  - testMultipleThreadsMultipleAppendersSameDestination: waiting for loggers
13:52:05,307 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - about to write batch of 4000 message(s)
13:52:05,310 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - about to write batch of 4000 message(s)
13:52:05,316 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - about to write batch of 4000 message(s)
13:52:05,319 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - about to write batch of 4000 message(s)
13:52:05,321 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - about to write batch of 4000 message(s)
13:52:05,633 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test3] - wrote batch of 4000 message(s)
13:52:05,652 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test4] - wrote batch of 4000 message(s)
13:52:05,655 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test2] - wrote batch of 4000 message(s)
13:52:05,681 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test1] - wrote batch of 4000 message(s)
13:52:05,685 |-INFO in com.kdgregory.logback.aws.CloudWatchAppender[test5] - wrote batch of 4000 message(s)
13:52:05.733 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: waiting for stream AppenderTest to be available
13:52:05.749 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: retrieving messages from AppenderTest
13:53:10.841 DEBUG - testMultipleThreadsMultipleAppendersSameDestination: retrieved 18594 messages from AppenderTest
13:53:10.842 INFO  - testMultipleThreadsMultipleAppendersSameDestination: finished
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 68.058 s <<< FAILURE! - in com.kdgregory.logback.aws.CloudWatchAppenderIntegrationTest
[ERROR] com.kdgregory.logback.aws.CloudWatchAppenderIntegrationTest.testMultipleThreadsMultipleAppendersSameDestination  Time elapsed: 67.145 s  <<< FAILURE!
java.lang.AssertionError: number of events in AppenderTest expected:<20000> but was:<18594>

When I use the CLI to describe the log stream, this is what I get back:

{
    "logStreams": [
        {
            "logStreamName": "AppenderTest",
            "creationTime": 1687701124368,
            "firstEventTimestamp": 1687701124627,
            "lastEventTimestamp": 1687701124830,
            "lastIngestionTime": 1687701125584,
            "uploadSequenceToken": "49039859551048028111434037619718318688748177139317960595",
            "arn": "arn:aws:logs:us-east-1:REDACTED:log-group:AppenderIntegrationTest-testMultipleThreadsMultipleAppendersSameDestination:log-stream:AppenderTest",
            "storedBytes": 0
        }
    ]
}

Note that the firstEventTimestamp is 1687701124627, but according to the log from the test, it should be 1687701124565.

kdgregory commented 1 year ago

As an FYI, if you are running into issues retrieving messages from CloudWatch Logs, you can tell support to look at case 12882575541. They might not be able to (and might not tell you that), but it's at least worth a shot.

kdgregory commented 10 months ago

AWS said they fixed the problem, and I've run through the integration tests three times without a failure, so I'm going to call this good to go and move forward with releasing 3.2.0.