spring-projects / spring-framework

Spring Framework
https://spring.io/projects/spring-framework
Apache License 2.0
56.46k stars 38.09k forks source link

WebClient toBodilessEntity loses tracing context in Spring WebFlux 3.0.5 #30273

Closed davidmelia closed 1 year ago

davidmelia commented 1 year ago

Hi,

I'm diagnosing a production Spring WebFlux 3.0.5 bug where certain requests are losing their tracing context and I've narrowed this down to the following (example https://github.com/davidmelia/spring-cloud-function-zipkin/tree/bodiless_entity_bug_

  @GetMapping("/notWork")
  Mono<String> notWork() {
    log.info("1) this has trace id");
    return webClient.head().uri("https://httpbin.org/status/200").retrieve().toBodilessEntity().map(r -> {
      log.info("2) this does not have a trace id {}",r.getStatusCode());
      return r;
    }).thenReturn("OK");
  }

...

INFO [aid=-,tid=642af7c399283af5b17037a829cb556a,sid=b17037a829cb556a] 56133 --- [ctor-http-nio-3] example.DemoController                   : 1) this has trace id
INFO [aid=-,tid=,sid=] 56133 --- [ctor-http-nio-3] example.DemoController                   : 2) this does not have a trace id 200 OK

where toBodilessEntity loses the tracing context. N.B. I am taking advantage of Hooks.enableAutomaticContextPropagation();

The work around is

  @GetMapping("/works")
  Mono<String> works() {
    log.info("3) this has trace id");
    return webClient.head().uri("https://httpbin.org/status/200").retrieve().toEntity(String.class).map(r -> {
      log.info("4) this has trace id {}",r.getStatusCode());
      return r;
    }).thenReturn("OK");
  }

I assume toBodilessEntity should keep the trace context?

Thanks

bclozel commented 1 year ago

Hello @davidmelia , thanks for raising this issue. The tracing context is made available during the observation, not after. In this case, the toBodilessEntity (and other "final" methods in the fluent API) returns the result of the HTTP client operation, here a Mono<ResponseEntity<Void>>. We assume that when you're given the actual response, the client observation is finished and the metric/trace is recorded. Extending the tracing context while you're map-ing on the result would sound like a bug: the time spent on transforming the response into another object should not count against the HTTP client observation.

The workaround you're sharing is not a workaround in my opinion: this merely enables the context propagation for ThreadLocal - here you should see the tracing context for the HTTP server observation, which is still ongoing at that stage since this is being executed within a web controller.

Am I missing something?

davidmelia commented 1 year ago

I don't quite follow you but we use trace IDs in production to log trace requests. I expect the server trace ID to be propagated through the stack. Using Spring Cloud Sleuth this was the case and the above example preserved the 'tid'. After upgrading to Spring Webflux 3 I can see trace IDs disappearing and I have narrowed down to the use of toBodilessEntity.

This is either a bug OR WebFlux users who use enableAutomaticContextPropagation should not use toBodilessEntity as it wipes out log trace IDs.

bclozel commented 1 year ago

Sorry, I misunderstood the issue - I thought this was about the client trace, not about the server trace. Let me have a look into this.

bclozel commented 1 year ago

I had a deeper look and I did reproduce the problem. I've checked the reactive pipeline for the toBodilessEntity method and it doesn't seem that we're disconnecting the chain in any way. The main difference with other operators could be here.

@davidmelia I've modified your sample application to ensure that the current Observation is in the reactor context right after the toBodilessEntity call and it is the case. At this stage, I was suspecting a subtle bug with the automatic context propagation feature. Upgrading to Spring Boot 3.0.6-SNAPSHOT (and reactor-core 3.5.5-SNAPSHOT) did solve the problem. I'm now seeing:

2023-04-07T11:22:46.381+02:00  INFO [aid=spring-cloud-function-zipkin,tid=642fe0e6e6e02ea934d44e45df0d661e,sid=34d44e45df0d661e] 23300 --- [ctor-http-nio-3] example.DemoController                   : 1) this has trace id
2023-04-07T11:22:47.156+02:00  INFO [aid=spring-cloud-function-zipkin,tid=,sid=d193b749e4691a89] 23300 --- [ctor-http-nio-3] example.DemoController                   : current observation name='http.server.requests', contextualName='null', error='null', lowCardinalityKeyValues=[exception='none', method='GET', outcome='SUCCESS', status='200', uri='UNKNOWN'], highCardinalityKeyValues=[http.url='/notWork'], map=[class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=642fe0e6e6e02ea934d44e45df0d661e/34d44e45df0d661e}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=0.795679716, duration(nanos)=7.95679716E8, startTimeNanos=5854912936959}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@646c1e62'], parentObservation=null
2023-04-07T11:22:47.158+02:00  INFO [aid=spring-cloud-function-zipkin,tid=642fe0e6e6e02ea934d44e45df0d661e,sid=34d44e45df0d661e] 23300 --- [ctor-http-nio-3] example.DemoController                   : 2) this does not have a trace id 200 OK

This change is available in my fork. I'm closing this issue as it's probably been fixed in reactor-core directly.

Tagging @chemicL to let him know 👍

chemicL commented 1 year ago

@bclozel thanks for letting me know.

Actually, the latest changes in Reactor don't really fix the reported issue. The fact that you introduced tap operators to the chains make a difference, as now they are considered as Context-altering barriers, in which case the ThreadLocal values do get restored. The same as if a contextWrite(Function.identity()) happened before the map operator.

This helps as the server Context with an Observation is attached to the entire chain by the server. However, the response is delivered from Netty on the event loop. If a contextWrite or tap operator is not along the way, there's nothing to restore in the eyes of reactor-core.

The difference between toEntity and toBodilessEntity is that in case of the latter, the response body Mono is not executed, it is simply ignored:

handleBodyMono(response, response.bodyToMono(bodyClass)

vs

handleBodyMono(response, Mono.<Void>empty())

I'm not yet certain why delivering the entity via the body processing passes the contextWrite operator and in the case of an empty body it does not, but this needs investigating. I would leave the issue open until we understand what is happening. For now, obviously, the workaround is:

toBodilessEntity().contextWrite(Function.identity())
bclozel commented 1 year ago

Reopening so we can investigate. This could be a Framework, Reactor Core or Reactor Netty issue.

chemicL commented 1 year ago

The previous suggestion is incorrect, it has nothing to do with using Mono.empty() in the handleBodyToMono argument.

The reason for the different behaviour is the fact that upon termination, in case of toEntity, the Publisher completion signal travels via responseMono's contextWrite operator defined in the exchange() method, and the completion signal triggers calling the user's code:

    @Override
    public <T> Mono<ResponseEntity<T>> toEntity(Class<T> bodyClass) {
        return this.responseMono.flatMap(response ->
                WebClientUtils.mapToEntity(response,
                        handleBodyMono(response, response.bodyToMono(bodyClass))));
    }

The case in toBodilessEntity is different, because the completion signal doesn't enter the responseMono any more - there is a new Mono left to subscribe to after the response is finished in this case, and it's the Mono.just(entity) created within the last line of the method in thenReturn(entity):

    @Override
    public Mono<ResponseEntity<Void>> toBodilessEntity() {
        return this.responseMono.flatMap(response ->
                WebClientUtils.mapToEntity(response, handleBodyMono(response, Mono.<Void>empty()))
                        .flatMap(entity -> response.releaseBody() // this is the
                                                   // difference
                                .onErrorResume(WebClientUtils.WRAP_EXCEPTION_PREDICATE, exceptionWrappingFunction(response))
                                .thenReturn(entity))
        );
    }

My current understanding is that the signals delivered from reactor-netty need to always pass a context restoring Subscriber. Whether reactor-netty should add such protection is debatable. I'm also unsure whether reactor-core should do it - we recently introduced protection against foreign Publishers that are wrapped using Flux.from(). In case signals are delivered from a different Thread, a dedicated Subscriber restores ThreadLocal values from the Subscriber Context.

The signal to the WebClient's Mono is delivered from reactor.netty.channel.FluxReceive#onInboundComplete. There should be a Subscriber that restores ThreadLocals directly after any signal delivered by reactor-netty. Where it should be handled is a matter for discussion.

chemicL commented 1 year ago

Here's a minimal reproducer for the issue:

@SpringBootApplication
public class WebFluxApplication {

    public static void main(String[] args) {
        SpringApplication.run(WebFluxApplication.class, args);
        Hooks.enableAutomaticContextPropagation();
    }

}

@RestController
class WebFluxController {

    Logger log = LoggerFactory.getLogger(WebFluxController.class);

    final WebClient webClient;

    WebFluxController(WebClient.Builder webClientBuilder) {
        this.webClient =
                webClientBuilder
//                        .clientConnector(new JettyClientHttpConnector())
//                        .clientConnector(new JdkClientHttpConnector())
                        .baseUrl("https://example.com")
                        .build();
    }

    @GetMapping("/webClient")
    public Mono<String> webClient() {
        log.info("Inside controller");
        return this.webClient
                .get()
                .retrieve()
                .toBodilessEntity()
                .doOnNext(r -> log.info("Missing trace-id"))
                .map(r -> r.getStatusCode().toString());
    }
}

With the application.yaml containing:

logging:
    pattern:
        level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"

In case of the commented lines (JdkClientHttpConnector|JettyClientHttpConnector) the problem does not expose itself with reactor-core 3.5.5 thanks to https://github.com/reactor/reactor-core/pull/3418.

For JdkClientHttpConnector, the resulting response is delivered as a CompletionStage wrapped into a Mono: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/JdkClientHttpConnector.java#L107

In this case, we treat the CompletionStage as a foreign Thread-changing environment and restore the ThreadLocal values.

For JettyClientHttpConnector, we're dealing with a foreign Publisher, which is also wrapped into a Mono and the ThreadLocal values are restored by calling Mono.fromDirect(): https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/JettyClientHttpConnector.java#L129

The default connector for reactor-netty deals with reactor types, which assume to be aware of ThreadLocal restoration, but they are not in case of FluxReceive: https://github.com/reactor/reactor-netty/blob/1.0.x/reactor-netty-core/src/main/java/reactor/netty/channel/FluxReceive.java#L413

When any signal is delivered as a result of Netty handling, it should be handling the ThreadLocal restoration.

One way to do it would be to implement the restoration in reactor-netty. Another would be to wrap the result with a contextWrite(Function.identity()) in spring-framework codebase, e.g. here: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/ReactorClientHttpConnector.java#L119.

IMO, reactor-netty should implement this restoration. If someone deals with reactor-netty's HttpClient directly, the same issue would need to be handled. If the restoration is performed on reactor-netty side if the context-propagation library is available and automatic context propagation is enabled, any downstream users wouldn't need to worry about this concern.

davidmelia commented 1 year ago

Hi guys, any more movement on this?

Is this something we will just live with by adding contextWrite(Function.identity()) or is it a bug?

Thanks

chemicL commented 1 year ago

@davidmelia we brainstormed on this and the course of action is that we intend do handle this in reactor-core: https://github.com/reactor/reactor-core/issues/3478. Once this work is ready, depending on whether other projects need to do anything special, we'll create more issues. https://github.com/reactor/reactor-core/issues/3468 contains the current roadmap for the context-propagation effort related to the Hooks.enableAutomaticContextPropagation feature.

bclozel commented 1 year ago

I'm closing this issue in favor of the reactor issues listed above. We'll reopen this if something needs to be done here.

agorbachenko commented 1 year ago

So .tap() is a workaround to restore context in the case of .releaseBody():

.exchangeToMono(response -> response.releaseBody()
                .tap(contextView -> new DefaultSignalListener<>() {
                })

I have tried this but with no success: cc @chemicL

One way to do it would be to implement the restoration in reactor-netty. Another would be to wrap the result with a contextWrite(Function.identity()) in spring-framework codebase, e.g. here: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/ReactorClientHttpConnector.java#L119.
chemicL commented 1 year ago

@agorbachenko I don't have the context about what you are doing in your application around this code. Please create a minimal reproducer and I can try to offer my help via the Micrometer Slack, where this conversation started.

Using tap the way you did is not the same as contextWrite(Function.identity()). However, where I tested it, it does the job for my scenario (tap is also a barrier for Context, as this operator can alter it). The flow can depend on the surrounding operators, so let's have a concrete example to discuss.

chemicL commented 10 months ago

For those coming to this issue, reactor-core 3.6.0 brings a solution to it. The problem is described in our blog post.