spring-projects / spring-framework

Spring Framework
https://spring.io/projects/spring-framework
Apache License 2.0
56.77k stars 38.16k forks source link

spring-boot-starter-webflux doesn't put traceId in logger MDC #29466

Closed 0x006EA1E5 closed 2 years ago

0x006EA1E5 commented 2 years ago

Affects: 6.0.0-RC4

Expected Behaviour When using spring-boot-starter-webflux I expect log message written in @RequestMappings methods to have the traceId and spanId in the MDC.

When using spring-boot-starter-web (non-reactive) this is indeed the observed behaviour, but it is not when using spring-boot-starter-webflux, as the logger MDC has no traceId nor spanId properties.

Having the traceId and spanId in the MDC is helpful, as it allows correlation of spans to log messages. This is an established use case, see for example the "jump from trace to logs" feature in Grafana Tempo.

Observed Behaviour See this repo for example https://github.com/0x006EA1E5/spring-boot-3-reactive-trace-logging

I can see when using spring-boot-starter-web that this functions as expected, as the non-reactive ServerHttpObservationFilter first create, then starts a new Observation, then wraps the rest of the filter chain in a try-with-resources block that opens an Observation.Scope.

However, using spring-boot-starter-webflux, with the reactive ServerHttpObservationFilter, no Observation.Scope is opened, although an Observation is created and start()ed

It seems odd, that although an Observation is created and started, no Observation.Scope is explicitly opened.

I can see that the Observation.Scope is AutoCloseable. Indeed, it appears that if the AutoCloseable is not closed at the end of a request, then the next request maintains the same traceId, which seems obviously wrong.

Therefore, it would seem that the AutoCloseable should be created as a step in the Mono, and closed in a doFinally() handler, but I don't understand this well.

See this thread for more discussion: https://gitter.im/spring-projects/spring-boot?at=636bb545abfb5860bb0aca48

Alternatively, perhaps when an Observation is started, a default Observation.Scope should be started, which should be closed when the Observation is stop()ed?

bclozel commented 2 years ago

Thanks for creating this issue. I'll try to summarize the current state of things here.

MVC and WebFlux have a different instrumentation and that's intended.

With Spring MVC, we're indeed opening an Observation.Scope so that the current context is put in the ThreadLocal. With Spring WebFlux (and as of #29388) we're not opening a scope, because populating the ThreadLocal doesn't make sense. Instead, we're setting the current observation in the reactor context so that it can be picked up as a parent observation later for tracing.

Micrometer documents the various instrumentation strategies for reactive libraries. If I'm not mistaken, using the .tap()or handle() operators will set things up for you, or you can get the reactor context and set up thread locals yourself, as demonstrated in this micrometer sample.

I think you're conflating observation scopes and traces. The Observation API provides an abstraction, and ObservationHandler implementations can react to observations and create metrics, traces as a result. If you're not convinced, you can check out the tracing handler implementation in micrometer. An Observation.Scope is merely a way to wrap calls and make the tracing context available to that snippet. The Javadoc for openScope() and parentObservation() should heopfully clear that up.

We will not open a scope in the reactive Observation filter as it won't solve this problem. Your sample is using a simple, synchronous variant of a Controller handler, but using the same setup and logging from the pipeline (within a deferContextual for example) will not yield the expected result. The critical difference here is that executing a Controller handler method in WebFlux can be asynchronous - it sets up the reactive pipeline but nothing gets executed until the subscription happens. In the case of a reactive web application, this subscription happens on any or even multiple server worker threads.

I'm closing this issue as a result, pinging @marcingrzejszczak so that he can correct me if I said something wrong! Thanks!

marcingrzejszczak commented 2 years ago

I can add one more thing that setting scope in reactive (so putting in thread local) and closing out on finally can lead to the situation where observation is put in thread local for thread 1 but it's cleared for thread 2 (cause on finally would be called in thread 2). So we would be polluting thread number 1.

0x006EA1E5 commented 2 years ago

I think you're conflating observation scopes and traces.

Well, this is probably true; I am just trying to learn this stuff now, so apologies if I have misunderstood. :)

I think I follow you now, and see the issue.

I have had another look at this. It seems that it can be made to work using a Reactor Hook, to call ContextSnapshot.setThreadLocalsFrom around a Subscriber's onNext(): https://github.com/0x006EA1E5/spring-boot-3-reactive-trace-logging/blob/use-reative-hook/src/main/java/me/geales/spring/trace/logging/ObservationContextSnapshotLifter.java#L54

This seems to work well across multiple execution threads.

Would something like the above be problematic, perhaps from a performance point of view?

The thing with all this is, the initial problem statement remains: users will expect the Observation stuff to "just work". As it does in MVC, people will want it to work in a WebFlux app, e.g., if something writes a log deep inside the work of processing a request, it should have the trace info available.

It seems to me that expecting users to known about ContextSnapshot.setThreadLocalsFrom, and paste that boilerplate all over their code, is an unfortunate barrier to adopting "observability" in the Spring WebFlux eco-system 😿

So getting as much of this to work "out of the box" seems like a reasonable thing to expect, at least in Spring Boot.

bclozel commented 2 years ago

I have had another look at this. It seems that it can be made to work using a Reactor Hook, to call ContextSnapshot.setThreadLocalsFrom around a Subscriber's onNext(): https://github.com/0x006EA1E5/spring-boot-3-reactive-trace-logging/blob/use-reative-hook/src/main/java/me/geales/spring/trace/logging/ObservationContextSnapshotLifter.java#L54

This seems to work well across multiple execution threads.

Would something like the above be problematic, perhaps from a performance point of view?

The MDC topic is extensively documented in Reactor and your performance concerns have been experienced by others.

The thing with all this is, the initial problem statement remains: users will expect the Observation stuff to "just work". As it does in MVC, people will want it to work in a WebFlux app, e.g., if something writes a log deep inside the work of processing a request, it should have the trace info available.

Taking a step back, this is not really about observability but the fact that MDC is based on ThreadLocal. We could discuss here the runtime behavior and programming style of reactive and how ThreadLocal could be replaced in such an environment, but that's partially the goal of the entire context-propagation project. Spring Framework here merely applies the guidance provided by context-propagation, micrometer-observation and reactor-core.

It seems to me that expecting users to known about ContextSnapshot.setThreadLocalsFrom, and paste that boilerplate all over their code, is an unfortunate barrier to adopting "observability" in the Spring WebFlux eco-system 😿

So getting as much of this to work "out of the box" seems like a reasonable thing to expect, at least in Spring Boot.

You can follow the reasoning and multiple solutions considered in reactor/reactor-core#3149.

I understand the appeal of that feature, but I think that making the MDC available everywhere in reactive code is a non-goal in general. This is likely to make the runtime cost higher than just using a Servlet-based approach. Issues related to ThreadLocal usage are not limited to reactive, but also a well-known design problem for Loom.

0x006EA1E5 commented 2 years ago

Thanks for the detailed reply.

It is interesting that since https://github.com/reactor/reactor-core/pull/3180 it is now the default behaviour to transparently restore the ThreadLocals for the handle and tap operators, if the context-propagation library is available at runtime.

https://projectreactor.io/docs/core/release/reference/#_operators_that_transparently_restore_a_snapshot_handle_and_tap

One last question: is it worth considering, in the special case where the handler method returns a non-reactive type, to use the handle operator instead of Mono.just(result)?

Something like:

        if (adapter == null) {
            return Mono.just(args)
                .handle((arguments, synchronousSink) -> synchronousSink.next(method.invoke(getBean(), arguments)))
                .map(value -> new HandlerResult(this, value, returnType, bindingContext));
        }

This then delegates all of the special handling to Reactor.

0x006EA1E5 commented 2 years ago

Sorry, the above suggestion is for org.springframework.web.reactive.result.method.InvocableHandlerMethod#invoke

bclozel commented 2 years ago

I don't think we should apply this suggestion as this would only be applied for synchronous method signatures but not on reactive ones. This would confuse developers even more. We should document this properly.

chemicL commented 1 year ago

Please have a look at the outcome of the related issue: https://github.com/spring-projects/spring-boot/issues/33372#issuecomment-1431142013

joseph-freeman commented 1 year ago

Good morning! I'm running into the similar issue when switching from Spring Cloud Sleuth to Micrometer Tracing. With Sleuth everything just seems to work with Reactor based Spring project. When upgrading our service to Spring Boot 3.1.0 and adding the dependency,

<dependency>
    <groupId>io.micrometer</groupId>
    <artifactId>micrometer-tracing-bridge-brave</artifactId>
</dependency>

the traceId and spanId are no logging present in the logs. I've also tried using Hooks.enableAutomaticContextPropagation(), that didn't work either. So it seems that when updating to Spring Boot 3.x, we lose the ability for distributed tracing. I've read this post and many others that seemingly tracing is not supported using Spring Webflux. If there's an update on when this will be available to Webflux projects that would great to now. Thank you!

bclozel commented 1 year ago

@joseph-freeman I'm afraid this is a bit vague. We do have working applications supporting this case. Please create a new issue with a minimal sample application showing the problem.

gianielsevier commented 1 year ago

I'm having exactly the same issue. Using spring boot 3.1.0 with spring-boot-starter-web and micrometer-tracing-bridge-brave I can see the traceId and spanId from the logs. If I switch to spring-boot-starter-webflux the traceId and spanId are gone.

This is the logging pattern for both versions:

logging.pattern.level=%5p [%X{traceId:-},%X{spanId:-}] 
bclozel commented 1 year ago

Same thing @gianielsevier , this description is not enough for us to help you. Please create a new issue with a sample application showing the problem.