spring-projects / spring-framework

Spring Framework
https://spring.io/projects/spring-framework
Apache License 2.0
56.63k stars 38.13k forks source link

Application hangs on startup after upgrading to Spring Boot 3.2.1 with spring-data-redis and micrometer-prometheus dependencies #32996

Closed Zernov-A closed 5 months ago

Zernov-A commented 9 months ago

After upgrading spring-boot to 3.2.1, I encountered an issue where the application hangs on startup. This occurs specifically when using spring-data-redis with enabled keyspace events on startup, in conjunction with the micrometer-prometheus dependencies.

Steps to Reproduce

I have prepared minimal code to reproduce this issue. Below is a simplified version of the build.gradle and DemoApp.java files:

build.gradle

plugins {
    id 'java'
    id 'org.springframework.boot' version "3.2.1"
}

repositories {
    mavenLocal()
    mavenCentral()
}

apply plugin: 'io.spring.dependency-management'

dependencies {
    implementation 'org.springframework.boot:spring-boot-starter-data-redis'
    runtimeOnly 'org.springframework.boot:spring-boot-starter-actuator'
    runtimeOnly 'io.micrometer:micrometer-registry-prometheus'
    runtimeOnly 'io.micrometer:micrometer-tracing-bridge-brave'
}

wrapper {
    gradleVersion = "8.5"
}

DemoApp.java

@SpringBootApplication
@EnableRedisRepositories(enableKeyspaceEvents = RedisKeyValueAdapter.EnableKeyspaceEvents.ON_STARTUP)
public class DemoApp {

    public static void main(String[] args) {
        new SpringApplication(DemoApp.class).run(args);
        System.out.println("Complete!");
    }
}

Observations

Additional Information

In a typical scenario, the project includes entity and repository classes, along with other dependencies. However, the provided code is a minimal representation.

scottfrederick commented 9 months ago

This appears to be a duplicate of https://github.com/spring-projects/spring-data-redis/issues/2814, which the Micrometer team is trying to get to the bottom of. @Zernov-A can you do a thread dump when the application hangs and confirm that it looks similar to the one in that issue?

Zernov-A commented 9 months ago

Thank you. It's really very similar. Here is the thread dump: threaddump.txt

scottfrederick commented 9 months ago

Thanks @Zernov-A. The thread dumps are very similar, with the same libraries in the call stack. I'll close this issue in favor of the Spring Data Redis issue that is already being investigated. Please follow that issue for updates.

jhengy commented 5 months ago

@scottfrederick My team is facing a similar issue in production. It seems that currently both this issue and https://github.com/spring-projects/spring-data-redis/issues/2814 were closed and labelled as for external-project. May I know if there's any team working on this bug? If so, can you point me to that team please? Thank you

wilkinsona commented 5 months ago

@jhengy, please capture a thread dump from the hung application and we should then be able to point you in the right direction.

jhengy commented 5 months ago

Hi @wilkinsona, thanks for the prompt reply, I've created a simple reproducer and also captured the thread dump.

Note that for our case, unlike the above, we did not enable keyspace events for redis on startup, what triggered this issue happens to be a bean definition for ReactiveRedisMessageListenerContainer.

wilkinsona commented 5 months ago

Thanks, @jhengy.

As you've seen, you can work around this issue with things like @Lazy, however, for a more fundamental fix, I think some changes in Spring Framework will be required. Framework currently calls third-party code while holding a lock and doing so is always at risk of deadlock or livelock in a multi-threaded environment.

We'll transfer this to the Framework team so that they can investigate further.

Tagging @mp911de as well as Lettuce is heavily involved here. I think what's happening is that creating the Redis message listener container on the main thread while Framework's singleton lock is held results in some work being done on lettuce-nioEventLoop-4-1 to establish the connection. The lock will not be released until the listener has been created. Due to instrumentation of Lettuce, establishing the connection requires a bean that's accessed through an ObjectProvider. Providing the object requires the singleton lock which is held by main. We're stuck at this point as the lock won't be released until the connection to Redis has been established and the listener container has been created and the connection won't be fully established until the lock has been released and the bean can be provided by the object provider.

jhoeller commented 5 months ago

Given that we revised the core container towards lenient locking for 6.2 already (#23501), does this scenario work against 6.2.0-M3? I'm afraid there is nothing we can do in 6.1.x as a quick measure, I rather recommend working around this through not waiting for the result of any asynchronous steps in the bean initialization method itself. I recommend doing so wherever possible even when running against 6.2, actually.

wilkinsona commented 5 months ago

does this scenario work against 6.2.0-M3?

I'm not sure. Upgrading the sample to 6.2.0-M3 results in the following failure:

Caused by: org.springframework.context.ApplicationContextException: Component scan could not be used with conditions in REGISTER_BEAN phase: [org.springframework.boot.autoconfigure.condition.OnBeanCondition@629ae7e]
        at org.springframework.context.annotation.ConfigurationClassParser.doProcessConfigurationClass(ConfigurationClassParser.java:325)
        at org.springframework.context.annotation.ConfigurationClassParser.processConfigurationClass(ConfigurationClassParser.java:267)
        at org.springframework.context.annotation.ConfigurationClassParser.processMemberClasses(ConfigurationClassParser.java:409)
        at org.springframework.context.annotation.ConfigurationClassParser.doProcessConfigurationClass(ConfigurationClassParser.java:294)
        at org.springframework.context.annotation.ConfigurationClassParser.processConfigurationClass(ConfigurationClassParser.java:267)
        at org.springframework.context.annotation.ConfigurationClassParser.processImports(ConfigurationClassParser.java:599)
        ... 18 common frames omitted

@snicoll may be able to give us a feel for it with his branch of Spring Boot that contains the necessary changes for Framework 6.2 compatibility. I say "a feel for it", as whether or not the problem occurs is quite delicate. For example, it does not occur in the supplied sample when the seemingly unrelated org.springframework.cloud:spring-cloud-stream-binder-kafka dependency is removed.

not waiting for the result of any asynchronous steps in the bean initialization method itself

This would require a change in Spring Data Redis where creating the instance of ReactiveRedisMessageListenerContainer gets a connection from Lettuce which blocks the main thread until the connection has been established on a Netty-managed thread.

jhoeller commented 5 months ago

This would require a change in Spring Data Redis where creating the instance of ReactiveRedisMessageListenerContainer gets a connection from Lettuce which blocks the main thread until the connection has been established on a Netty-managed thread.

Alternatively, the arrangement may also enforce all of the required singleton beans to be initialized beforehand. Accessing an existing singleton bean does not require the singleton lock, that's just necessary for creating it on demand.

wilkinsona commented 5 months ago

enforce all of the required singleton beans to be initialized beforehand

That would require changes in Micrometer (Tracing). Boot currently defers the creation of the Tracer using ObjectProvider and a singleton supplier as there's a cycle when Prometheus Examplars are involved because the MeterRegistry can depend on the Tracer (Exemplars) and the Tracer can depend on the MeterRegistry (recording metrics).

/cc @jonatan-ivanov

snicoll commented 5 months ago

I can confirm the sample fails as Andy described with Spring Boot 3.3.1-SNAPSHOT and the modifications I've added to make it build cleanly with Spring Framework 6.2.0-SNAPSHOT. The exception is thrown from an area that was changed in 6.2.x so it might not be totally related and perhaps the lock still holds once we're past that.

The problem is in org.springframework.cloud.function.context.config.ContextFunctionCatalogAutoConfiguration.PlainFunctionScanConfiguration that does component scan while also checking if a FunctionCatalog bean is present. I've created https://github.com/spring-cloud/spring-cloud-function/issues/1149

snicoll commented 5 months ago

@jhoeller with ContextFunctionCatalogAutoConfiguration excluded, the application starts cleanly with Spring Framework 6.2.0-SNAPSHOT.

jhoeller commented 5 months ago

Thanks, @snicoll! From a Spring Framework perspective, we seem to have a strategic solution for these kinds of scenarios as of 6.2 then.

Against 6.1.x, it will have to be on the workarounds suggested above, either in Spring Data or in Micrometer. As mentioned, it is generally recommendable to relax such asynchronous init method arrangements and not rely on the container's locking behavior there. Any such change made in Spring Data or Micrometer will make sense against 6.2 as well, even if not strictly necessary there.

wilkinsona commented 5 months ago

@jonatan-ivanov @mp911de, can you please take a look at this? With https://github.com/spring-projects/spring-data-redis/issues/2814 having previously been closed in favor of a Boot issue, we're in danger of going round in circles without identifying a fix.

mp911de commented 5 months ago

Spring Data Redis isn't involved much. Tracing is configured on the Lettuce driver through MicrometerCommandLatencyRecorder that holds an instance of an object that circles back into the Spring context. My envisioned fix would be moving observability bits into an earlier phase and a facility, that isn't related to context bootstrapping, similar to ApplicationStartup.

Any asynchronous database activity can suffer from that arrangement; Database activity during the container bootstrap happens on a regular basis and the number of drivers utilizing netty or any other async approach is growing.

From that perspective, I would welcome anything that removes the need for asynchronously fetching beans from a container that is being started.

wilkinsona commented 5 months ago

Until a fix in Framework is available, I think we may have to make changes in several areas to get things working fully.

Maybe we can move key expiration listener initialization from bean initialization step (in RedisKeyValueAdapter) to phase when application context will be fully configured? This should prevent deadlock. The current scenario, if we use the EnableKeyspaceEvents.ON_STARTUP option, leads to the creation of the key expiration listener and the initiation of requests to Redis during bean initialization, and as a result we have deadlock with lazy Tracer initializing.

Is this suggestion from https://github.com/spring-projects/spring-data-redis/issues/2814 not viable? Similarly, avoiding the connection retrieval in the constructor of ReactiveRedisMessageListenerContainer would also help I think.

wilkinsona commented 5 months ago

From a Framework perspective, I think this issue can be closed. There's a fix coming in 6.2 and nothing, AIUI, that can be done in 6.1.x.

We've got https://github.com/spring-projects/spring-boot/issues/40972 that's tracking the problem on the Boot side (although it may need to move to Micrometer (Tracing) as Boot's code is just trying to work around the MeterRegistry <-> Tracer cycle), and I'd welcome the re-opening of https://github.com/spring-projects/spring-data-redis/issues/2814 to track adoption of the suggestion to "relax such asynchronous init method arrangements and not rely on the container's locking behavior there".

snicoll commented 5 months ago

Thanks for the summary, Andy.

wilkinsona commented 5 months ago

@jhengy, judging by your reproducer, you may be able to work around the problem by defining an eager SpanContextSupplier:

    @Bean
    TracerSpanContextSupplier spanContextSuppler(Tracer tracer) {
        return new TracerSpanContextSupplier(tracer);
    }

    static class TracerSpanContextSupplier implements SpanContextSupplier {

        private final Tracer tracer;

        TracerSpanContextSupplier(Tracer tracer) {
            this.tracer = tracer;
        }

        @Override
        public String getTraceId() {
            Span currentSpan = currentSpan();
            return (currentSpan != null) ? currentSpan.context().traceId() : null;
        }

        @Override
        public String getSpanId() {
            Span currentSpan = currentSpan();
            return (currentSpan != null) ? currentSpan.context().spanId() : null;
        }

        @Override
        public boolean isSampled() {
            Span currentSpan = currentSpan();
            if (currentSpan == null) {
                return false;
            }
            Boolean sampled = currentSpan.context().sampled();
            return sampled != null && sampled;
        }

        private Span currentSpan() {
            return this.tracer.currentSpan();
        }

    }

It starts up successfully with this bean in place.

rethab commented 2 months ago

For spring boot 3.3, since the SpanContextSupplier bean was replaced with a SpanContext bean, it looks like we need to do the same trick, but then with the SpanContext?

This seems to work for me:

    @Bean
    SpanContext spanContext(Tracer tracer) {
        return new TracingSpanContext(tracer);
    }

    static class TracingSpanContext implements SpanContext {

        private final Tracer tracer;

        TracingSpanContext(Tracer tracer) {
            this.tracer = tracer;
        }

        @Override
        public String getCurrentTraceId() {
            Span currentSpan = currentSpan();
            return (currentSpan != null) ? currentSpan.context().traceId() : null;
        }

        @Override
        public String getCurrentSpanId() {
            Span currentSpan = currentSpan();
            return (currentSpan != null) ? currentSpan.context().spanId() : null;
        }

        @Override
        public boolean isCurrentSpanSampled() {
            Span currentSpan = currentSpan();
            if (currentSpan == null) {
                return false;
            }
            Boolean sampled = currentSpan.context().sampled();
            return sampled != null && sampled;
        }

        @Override
        public void markCurrentSpanAsExemplar() {
        }

        private Span currentSpan() {
            return this.tracer.currentSpan();
        }
    }
rethab commented 2 months ago

Would it be possible to allow disabling exemplar support entirely @jonatan-ivanov ?

I see PrometheusExemplarsAutoConfiguration has the @ConditionalOnClass({ Tracer.class, SpanContext.class }), which made me think I could just exclude the dependency io.prometheus:prometheus-metrics-tracer-common, but then PrometheusMetricsExportAutoConfiguration fails.

jonatan-ivanov commented 2 months ago

In Micrometer, if you don't provide SpanContext it will not sample Exemplars. I guess the feature you are asking could be a Boot property to not auto-configure it.

You can also create your own @Bean of PrometheusMeterRegistry where you don't inject any SpanContext which disables Exemplars.

You can also create your own @Bean of SpanContext where you return false in isCurrentSpanSampled and null in the getters.

It's a hack but Micrometer and Boot also supports some of the Prometheus Client Properties, if you set sampleIntervalMilliseconds high enough you effectively disable Exemplars. Also, this will not solve the bean creation issue mentioned above since auto-configuration will do the same.

If you want to exclude dependencies, you need to exclude Micrometer Tracing otherwise auto-configuration will try to create a SpanContext for it.