kdgregory / log4j-aws-appenders

Appenders for Log4J 1.2.x, Log4J 2.x, and Logback that write to AWS destinations.
Apache License 2.0
67 stars 18 forks source link

ERROR writer initialization timed out #167

Closed yagrxu closed 1 year ago

yagrxu commented 1 year ago

I see this error message when application starts, and it hangs for 60 seconds as timeout code indicates. 2022-10-19 16:04:14,305 main ERROR writer initialization timed out However, everything else works without a problem... I would like to ask for some hints where it could be hanging?

My setup:

  1. log4j2.xml configuration
    <?xml version="1.0" encoding="UTF-8"?>
    <Configuration status="INFO" packages="com.kdgregory.log4j2.aws">
    <Appenders>
        <Console name="ConsoleAppender" target="SYSTEM_OUT">
            <PatternLayout pattern="%d{yyy-MM-dd HH:mm:ss.SSS} [%t] %-5level %logger{36} - %X{AWS-XRAY-TRACE-ID} - %msg%n" />
        </Console>
        <CloudWatchAppender name="CLOUDWATCH">
            <logGroup>log4j-appender-demo</logGroup>
            <logStream>demo-{date}-{hostname}-${awslogs:pid}</logStream>
            <dedicatedWriter>true</dedicatedWriter>
            <PatternLayout pattern="%d{yyyy-MM-dd HH:mm:ss.SSS} %-5p [%t] %c - %X{AWS-XRAY-TRACE-ID} - %m%n" />
        </CloudWatchAppender>
    </Appenders>
    <Loggers>
        <Root level="info">
            <AppenderRef ref="ConsoleAppender" />
            <AppenderRef ref="CLOUDWATCH" />
        </Root>
    </Loggers>
    </Configuration>
  2. dependencies:
    <dependencies>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
            <exclusions>
                <exclusion>
                    <groupId>org.springframework.boot</groupId>
                    <artifactId>spring-boot-starter-logging</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-openfeign</artifactId>
        </dependency>
        <dependency>
            <groupId>io.github.openfeign</groupId>
            <artifactId>feign-httpclient</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.cloud</groupId>
            <artifactId>spring-cloud-starter-loadbalancer</artifactId>
        </dependency>
        <!-- log related dependencies-->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-log4j2</artifactId>
        </dependency>
        <dependency>
            <groupId>com.kdgregory.logging</groupId>
            <artifactId>log4j2-aws-appenders</artifactId>
            <version>3.0.1</version>
        </dependency>
        <dependency>
            <groupId>com.kdgregory.logging</groupId>
            <artifactId>aws-facade-v2</artifactId>
            <version>3.0.1</version>
            <exclusions>
                <exclusion>
                    <groupId>software.amazon.awssdk</groupId>
                    <artifactId>sts</artifactId>
                </exclusion>
            </exclusions>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>cloudwatchlogs</artifactId>
            <version>2.17.281</version>
        </dependency>
        <dependency>
            <groupId>software.amazon.awssdk</groupId>
            <artifactId>sts</artifactId>
            <version>2.17.281</version>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>
  3. Scenario: ADOT with Log4j2
    • I would like to leverage open-telemetry java agent to inject %X{AWS-XRAY-TRACE-ID} into cloud watch logs
    • credential chain: role assume is from serviceaccount OIDC provider. and this is why I have to upgrade sts SDK version

So... With this setup, everything works but hanging for 1 minutes when starts - very annoying... Can anyone bing me a silver-lining?

Thanks

kdgregory commented 1 year ago

@yagrxu - are you saying that you see the timeout message but then see messages written to CloudWatch?

This should not be possible: the normal cause of a timeout during appender initialization is networking: either you're running in a private subnet without NAT, or have a security group that restricts outbound access, and in that case the appender should never work.

It is possible but unlikely, for the AWS SDK calls to be internally retrying due to contention. This would require that you have dozens if not hundreds of clients all attempting to perform the same API calls (describing the log group, creating a log group and/or stream, and retrieving sequence token).

It would be helpful if you could change the internal log level to DEBUG as described in the Troubleshooting Guide, and add the output to this issue. That log will give detailed information about the appender's internal initialization.

Regardless, I also need to rethink the initialization timeout. Log4J2 doesn't disable appenders, so the timeout is, as you say, annoying. However, it does provide some value for the case where the appender is blocked by the network.

As for AWS-XRAY-TRACE-ID, that's handled entirely by the layout manager, nothing to do with the appender. Are you adding that key to the mapped diagnostic context?

Finally, you don't need the <exclusion> on the STS JAR, because it's marked as <provided> in the POM. The only "hard" dependencies in this library should be internal ones. If you are seeing that one of the library JARs pulls in a transitive dependency, I'd very much like to see the output of mvn dependency:tree so that I can fix it.

I do, however, recommend using a consistent version for all AWS SDK JARs. While they generally maintain backwards compatibility, in my opinion the v2 SDK is still in flux.

kdgregory commented 1 year ago

After looking at the code, I don't think that wait is required for either Logback or Log4J2.x, so am going to remove it.

I'll probably use a configurable startup timeout for the logwriter itself (which will be running on a thread so not affect overall application start). I also need to think about how the appenders will behave if the timeout expires: do I continue to try to initialize the writer? This seems like it will be wasted in almost all cases, and I should just shut the writer down and discard everything that's on the queue.

Given my current schedule, and the need to review all 9 appenders (3 appenders x 3 logging frameworks), it will be a couple of weeks before this is released.

kdgregory commented 1 year ago

Update: my current estimate is early December.

kdgregory commented 1 year ago

@yagrxu - if you're willing to build it yourself, I've merged this work into the trunk.

It passes all integration tests except for synchonous mode (which hopefully you're not using). I need to make some changes to how that works (it currently has a race condition), along with some other changes. Still looking to make a release either next weekend or the following. Will close the ticket then.

kdgregory commented 1 year ago

This is now on Maven Central.

boroda4436 commented 1 year ago

Thanks for the fix! This is a problem, that I've mentioned in the issue https://github.com/kdgregory/log4j-aws-appenders/issues/159. Version update fixes the issue.