Closed Zernov-A closed 5 months ago
This appears to be a duplicate of https://github.com/spring-projects/spring-data-redis/issues/2814, which the Micrometer team is trying to get to the bottom of. @Zernov-A can you do a thread dump when the application hangs and confirm that it looks similar to the one in that issue?
Thank you. It's really very similar. Here is the thread dump: threaddump.txt
Thanks @Zernov-A. The thread dumps are very similar, with the same libraries in the call stack. I'll close this issue in favor of the Spring Data Redis issue that is already being investigated. Please follow that issue for updates.
@scottfrederick My team is facing a similar issue in production. It seems that currently both this issue and https://github.com/spring-projects/spring-data-redis/issues/2814 were closed and labelled as for external-project. May I know if there's any team working on this bug? If so, can you point me to that team please? Thank you
@jhengy, please capture a thread dump from the hung application and we should then be able to point you in the right direction.
Hi @wilkinsona, thanks for the prompt reply, I've created a simple reproducer and also captured the thread dump.
Note that for our case, unlike the above, we did not enable keyspace events for redis on startup, what triggered this issue happens to be a bean definition for ReactiveRedisMessageListenerContainer
.
Thanks, @jhengy.
As you've seen, you can work around this issue with things like @Lazy
, however, for a more fundamental fix, I think some changes in Spring Framework will be required. Framework currently calls third-party code while holding a lock and doing so is always at risk of deadlock or livelock in a multi-threaded environment.
We'll transfer this to the Framework team so that they can investigate further.
Tagging @mp911de as well as Lettuce is heavily involved here. I think what's happening is that creating the Redis message listener container on the main
thread while Framework's singleton lock is held results in some work being done on lettuce-nioEventLoop-4-1
to establish the connection. The lock will not be released until the listener has been created. Due to instrumentation of Lettuce, establishing the connection requires a bean that's accessed through an ObjectProvider
. Providing the object requires the singleton lock which is held by main
. We're stuck at this point as the lock won't be released until the connection to Redis has been established and the listener container has been created and the connection won't be fully established until the lock has been released and the bean can be provided by the object provider.
Given that we revised the core container towards lenient locking for 6.2 already (#23501), does this scenario work against 6.2.0-M3? I'm afraid there is nothing we can do in 6.1.x as a quick measure, I rather recommend working around this through not waiting for the result of any asynchronous steps in the bean initialization method itself. I recommend doing so wherever possible even when running against 6.2, actually.
does this scenario work against 6.2.0-M3?
I'm not sure. Upgrading the sample to 6.2.0-M3 results in the following failure:
Caused by: org.springframework.context.ApplicationContextException: Component scan could not be used with conditions in REGISTER_BEAN phase: [org.springframework.boot.autoconfigure.condition.OnBeanCondition@629ae7e]
at org.springframework.context.annotation.ConfigurationClassParser.doProcessConfigurationClass(ConfigurationClassParser.java:325)
at org.springframework.context.annotation.ConfigurationClassParser.processConfigurationClass(ConfigurationClassParser.java:267)
at org.springframework.context.annotation.ConfigurationClassParser.processMemberClasses(ConfigurationClassParser.java:409)
at org.springframework.context.annotation.ConfigurationClassParser.doProcessConfigurationClass(ConfigurationClassParser.java:294)
at org.springframework.context.annotation.ConfigurationClassParser.processConfigurationClass(ConfigurationClassParser.java:267)
at org.springframework.context.annotation.ConfigurationClassParser.processImports(ConfigurationClassParser.java:599)
... 18 common frames omitted
@snicoll may be able to give us a feel for it with his branch of Spring Boot that contains the necessary changes for Framework 6.2 compatibility. I say "a feel for it", as whether or not the problem occurs is quite delicate. For example, it does not occur in the supplied sample when the seemingly unrelated org.springframework.cloud:spring-cloud-stream-binder-kafka
dependency is removed.
not waiting for the result of any asynchronous steps in the bean initialization method itself
This would require a change in Spring Data Redis where creating the instance of ReactiveRedisMessageListenerContainer
gets a connection from Lettuce which blocks the main thread until the connection has been established on a Netty-managed thread.
This would require a change in Spring Data Redis where creating the instance of ReactiveRedisMessageListenerContainer gets a connection from Lettuce which blocks the main thread until the connection has been established on a Netty-managed thread.
Alternatively, the arrangement may also enforce all of the required singleton beans to be initialized beforehand. Accessing an existing singleton bean does not require the singleton lock, that's just necessary for creating it on demand.
enforce all of the required singleton beans to be initialized beforehand
That would require changes in Micrometer (Tracing). Boot currently defers the creation of the Tracer
using ObjectProvider
and a singleton supplier as there's a cycle when Prometheus Examplars are involved because the MeterRegistry can depend on the Tracer
(Exemplars) and the Tracer
can depend on the MeterRegistry
(recording metrics).
/cc @jonatan-ivanov
I can confirm the sample fails as Andy described with Spring Boot 3.3.1-SNAPSHOT
and the modifications I've added to make it build cleanly with Spring Framework 6.2.0-SNAPSHOT
. The exception is thrown from an area that was changed in 6.2.x
so it might not be totally related and perhaps the lock still holds once we're past that.
The problem is in org.springframework.cloud.function.context.config.ContextFunctionCatalogAutoConfiguration.PlainFunctionScanConfiguration
that does component scan while also checking if a FunctionCatalog
bean is present. I've created https://github.com/spring-cloud/spring-cloud-function/issues/1149
@jhoeller with ContextFunctionCatalogAutoConfiguration
excluded, the application starts cleanly with Spring Framework 6.2.0-SNAPSHOT
.
Thanks, @snicoll! From a Spring Framework perspective, we seem to have a strategic solution for these kinds of scenarios as of 6.2 then.
Against 6.1.x, it will have to be on the workarounds suggested above, either in Spring Data or in Micrometer. As mentioned, it is generally recommendable to relax such asynchronous init method arrangements and not rely on the container's locking behavior there. Any such change made in Spring Data or Micrometer will make sense against 6.2 as well, even if not strictly necessary there.
@jonatan-ivanov @mp911de, can you please take a look at this? With https://github.com/spring-projects/spring-data-redis/issues/2814 having previously been closed in favor of a Boot issue, we're in danger of going round in circles without identifying a fix.
Spring Data Redis isn't involved much. Tracing is configured on the Lettuce driver through MicrometerCommandLatencyRecorder
that holds an instance of an object that circles back into the Spring context.
My envisioned fix would be moving observability bits into an earlier phase and a facility, that isn't related to context bootstrapping, similar to ApplicationStartup
.
Any asynchronous database activity can suffer from that arrangement; Database activity during the container bootstrap happens on a regular basis and the number of drivers utilizing netty or any other async approach is growing.
From that perspective, I would welcome anything that removes the need for asynchronously fetching beans from a container that is being started.
Until a fix in Framework is available, I think we may have to make changes in several areas to get things working fully.
Maybe we can move key expiration listener initialization from bean initialization step (in RedisKeyValueAdapter) to phase when application context will be fully configured? This should prevent deadlock. The current scenario, if we use the EnableKeyspaceEvents.ON_STARTUP option, leads to the creation of the key expiration listener and the initiation of requests to Redis during bean initialization, and as a result we have deadlock with lazy Tracer initializing.
Is this suggestion from https://github.com/spring-projects/spring-data-redis/issues/2814 not viable? Similarly, avoiding the connection retrieval in the constructor of ReactiveRedisMessageListenerContainer
would also help I think.
From a Framework perspective, I think this issue can be closed. There's a fix coming in 6.2 and nothing, AIUI, that can be done in 6.1.x.
We've got https://github.com/spring-projects/spring-boot/issues/40972 that's tracking the problem on the Boot side (although it may need to move to Micrometer (Tracing) as Boot's code is just trying to work around the MeterRegistry <-> Tracer
cycle), and I'd welcome the re-opening of https://github.com/spring-projects/spring-data-redis/issues/2814 to track adoption of the suggestion to "relax such asynchronous init method arrangements and not rely on the container's locking behavior there".
Thanks for the summary, Andy.
@jhengy, judging by your reproducer, you may be able to work around the problem by defining an eager SpanContextSupplier
:
@Bean
TracerSpanContextSupplier spanContextSuppler(Tracer tracer) {
return new TracerSpanContextSupplier(tracer);
}
static class TracerSpanContextSupplier implements SpanContextSupplier {
private final Tracer tracer;
TracerSpanContextSupplier(Tracer tracer) {
this.tracer = tracer;
}
@Override
public String getTraceId() {
Span currentSpan = currentSpan();
return (currentSpan != null) ? currentSpan.context().traceId() : null;
}
@Override
public String getSpanId() {
Span currentSpan = currentSpan();
return (currentSpan != null) ? currentSpan.context().spanId() : null;
}
@Override
public boolean isSampled() {
Span currentSpan = currentSpan();
if (currentSpan == null) {
return false;
}
Boolean sampled = currentSpan.context().sampled();
return sampled != null && sampled;
}
private Span currentSpan() {
return this.tracer.currentSpan();
}
}
It starts up successfully with this bean in place.
For spring boot 3.3, since the SpanContextSupplier
bean was replaced with a SpanContext
bean, it looks like we need to do the same trick, but then with the SpanContext
?
This seems to work for me:
@Bean
SpanContext spanContext(Tracer tracer) {
return new TracingSpanContext(tracer);
}
static class TracingSpanContext implements SpanContext {
private final Tracer tracer;
TracingSpanContext(Tracer tracer) {
this.tracer = tracer;
}
@Override
public String getCurrentTraceId() {
Span currentSpan = currentSpan();
return (currentSpan != null) ? currentSpan.context().traceId() : null;
}
@Override
public String getCurrentSpanId() {
Span currentSpan = currentSpan();
return (currentSpan != null) ? currentSpan.context().spanId() : null;
}
@Override
public boolean isCurrentSpanSampled() {
Span currentSpan = currentSpan();
if (currentSpan == null) {
return false;
}
Boolean sampled = currentSpan.context().sampled();
return sampled != null && sampled;
}
@Override
public void markCurrentSpanAsExemplar() {
}
private Span currentSpan() {
return this.tracer.currentSpan();
}
}
Would it be possible to allow disabling exemplar support entirely @jonatan-ivanov ?
I see PrometheusExemplarsAutoConfiguration
has the @ConditionalOnClass({ Tracer.class, SpanContext.class })
, which made me think I could just exclude the dependency io.prometheus:prometheus-metrics-tracer-common
, but then PrometheusMetricsExportAutoConfiguration
fails.
In Micrometer, if you don't provide SpanContext
it will not sample Exemplars. I guess the feature you are asking could be a Boot property to not auto-configure it.
You can also create your own @Bean
of PrometheusMeterRegistry
where you don't inject any SpanContext
which disables Exemplars.
You can also create your own @Bean
of SpanContext
where you return false
in isCurrentSpanSampled
and null
in the getters.
It's a hack but Micrometer and Boot also supports some of the Prometheus Client Properties, if you set sampleIntervalMilliseconds
high enough you effectively disable Exemplars. Also, this will not solve the bean creation issue mentioned above since auto-configuration will do the same.
If you want to exclude dependencies, you need to exclude Micrometer Tracing otherwise auto-configuration will try to create a SpanContext
for it.
After upgrading spring-boot to 3.2.1, I encountered an issue where the application hangs on startup. This occurs specifically when using spring-data-redis with enabled keyspace events on startup, in conjunction with the micrometer-prometheus dependencies.
Steps to Reproduce
I have prepared minimal code to reproduce this issue. Below is a simplified version of the build.gradle and DemoApp.java files:
build.gradle
DemoApp.java
Observations
Additional Information
In a typical scenario, the project includes entity and repository classes, along with other dependencies. However, the provided code is a minimal representation.
Removing
enableKeyspaceEvents = RedisKeyValueAdapter.EnableKeyspaceEvents.ON_STARTUP
or changing its value toON_DEMAND
orOFF
allows the application to start correctly.Removing any of the dependencies results in the application starting correctly.
Disabling Prometheus metrics export by setting
management.prometheus.metrics.export.enabled=false
inapplication.properties
also allows the application to start correctly.