Closed davidmelia closed 1 year ago
Hello @davidmelia , thanks for raising this issue.
The tracing context is made available during the observation, not after. In this case, the toBodilessEntity
(and other "final" methods in the fluent API) returns the result of the HTTP client operation, here a Mono<ResponseEntity<Void>>
. We assume that when you're given the actual response, the client observation is finished and the metric/trace is recorded. Extending the tracing context while you're map
-ing on the result would sound like a bug: the time spent on transforming the response into another object should not count against the HTTP client observation.
The workaround you're sharing is not a workaround in my opinion: this merely enables the context propagation for ThreadLocal - here you should see the tracing context for the HTTP server observation, which is still ongoing at that stage since this is being executed within a web controller.
Am I missing something?
I don't quite follow you but we use trace IDs in production to log trace requests. I expect the server trace ID to be propagated through the stack. Using Spring Cloud Sleuth this was the case and the above example preserved the 'tid'. After upgrading to Spring Webflux 3 I can see trace IDs disappearing and I have narrowed down to the use of toBodilessEntity.
This is either a bug OR WebFlux users who use enableAutomaticContextPropagation should not use toBodilessEntity as it wipes out log trace IDs.
Sorry, I misunderstood the issue - I thought this was about the client trace, not about the server trace. Let me have a look into this.
I had a deeper look and I did reproduce the problem.
I've checked the reactive pipeline for the toBodilessEntity
method and it doesn't seem that we're disconnecting the chain in any way. The main difference with other operators could be here.
@davidmelia I've modified your sample application to ensure that the current Observation is in the reactor context right after the toBodilessEntity
call and it is the case. At this stage, I was suspecting a subtle bug with the automatic context propagation feature. Upgrading to Spring Boot 3.0.6-SNAPSHOT (and reactor-core 3.5.5-SNAPSHOT) did solve the problem. I'm now seeing:
2023-04-07T11:22:46.381+02:00 INFO [aid=spring-cloud-function-zipkin,tid=642fe0e6e6e02ea934d44e45df0d661e,sid=34d44e45df0d661e] 23300 --- [ctor-http-nio-3] example.DemoController : 1) this has trace id
2023-04-07T11:22:47.156+02:00 INFO [aid=spring-cloud-function-zipkin,tid=,sid=d193b749e4691a89] 23300 --- [ctor-http-nio-3] example.DemoController : current observation name='http.server.requests', contextualName='null', error='null', lowCardinalityKeyValues=[exception='none', method='GET', outcome='SUCCESS', status='200', uri='UNKNOWN'], highCardinalityKeyValues=[http.url='/notWork'], map=[class io.micrometer.tracing.handler.TracingObservationHandler$TracingContext='TracingContext{span=642fe0e6e6e02ea934d44e45df0d661e/34d44e45df0d661e}', class io.micrometer.core.instrument.LongTaskTimer$Sample='SampleImpl{duration(seconds)=0.795679716, duration(nanos)=7.95679716E8, startTimeNanos=5854912936959}', class io.micrometer.core.instrument.Timer$Sample='io.micrometer.core.instrument.Timer$Sample@646c1e62'], parentObservation=null
2023-04-07T11:22:47.158+02:00 INFO [aid=spring-cloud-function-zipkin,tid=642fe0e6e6e02ea934d44e45df0d661e,sid=34d44e45df0d661e] 23300 --- [ctor-http-nio-3] example.DemoController : 2) this does not have a trace id 200 OK
This change is available in my fork. I'm closing this issue as it's probably been fixed in reactor-core directly.
Tagging @chemicL to let him know 👍
@bclozel thanks for letting me know.
Actually, the latest changes in Reactor don't really fix the reported issue. The fact that you introduced tap
operators to the chains make a difference, as now they are considered as Context
-altering barriers, in which case the ThreadLocal
values do get restored. The same as if a contextWrite(Function.identity())
happened before the map
operator.
This helps as the server Context
with an Observation
is attached to the entire chain by the server. However, the response is delivered from Netty on the event loop. If a contextWrite
or tap operator is not along the way, there's nothing to restore in the eyes of reactor-core.
The difference between toEntity
and toBodilessEntity
is that in case of the latter, the response body Mono
is not executed, it is simply ignored:
handleBodyMono(response, response.bodyToMono(bodyClass)
vs
handleBodyMono(response, Mono.<Void>empty())
I'm not yet certain why delivering the entity via the body processing passes the contextWrite
operator and in the case of an empty body it does not, but this needs investigating. I would leave the issue open until we understand what is happening. For now, obviously, the workaround is:
toBodilessEntity().contextWrite(Function.identity())
Reopening so we can investigate. This could be a Framework, Reactor Core or Reactor Netty issue.
The previous suggestion is incorrect, it has nothing to do with using Mono.empty()
in the handleBodyToMono
argument.
The reason for the different behaviour is the fact that upon termination, in case of toEntity
, the Publisher
completion signal travels via responseMono
's contextWrite
operator defined in the exchange()
method, and the completion signal triggers calling the user's code:
@Override
public <T> Mono<ResponseEntity<T>> toEntity(Class<T> bodyClass) {
return this.responseMono.flatMap(response ->
WebClientUtils.mapToEntity(response,
handleBodyMono(response, response.bodyToMono(bodyClass))));
}
The case in toBodilessEntity
is different, because the completion signal doesn't enter the responseMono
any more - there is a new Mono
left to subscribe to after the response is finished in this case, and it's the Mono.just(entity)
created within the last line of the method in thenReturn(entity)
:
@Override
public Mono<ResponseEntity<Void>> toBodilessEntity() {
return this.responseMono.flatMap(response ->
WebClientUtils.mapToEntity(response, handleBodyMono(response, Mono.<Void>empty()))
.flatMap(entity -> response.releaseBody() // this is the
// difference
.onErrorResume(WebClientUtils.WRAP_EXCEPTION_PREDICATE, exceptionWrappingFunction(response))
.thenReturn(entity))
);
}
My current understanding is that the signals delivered from reactor-netty need to always pass a context restoring Subscriber. Whether reactor-netty should add such protection is debatable. I'm also unsure whether reactor-core should do it - we recently introduced protection against foreign Publisher
s that are wrapped using Flux.from()
. In case signals are delivered from a different Thread
, a dedicated Subscriber
restores ThreadLocal
values from the Subscriber
Context
.
The signal to the WebClient's Mono
is delivered from reactor.netty.channel.FluxReceive#onInboundComplete
. There should be a Subscriber
that restores ThreadLocal
s directly after any signal delivered by reactor-netty. Where it should be handled is a matter for discussion.
Here's a minimal reproducer for the issue:
@SpringBootApplication
public class WebFluxApplication {
public static void main(String[] args) {
SpringApplication.run(WebFluxApplication.class, args);
Hooks.enableAutomaticContextPropagation();
}
}
@RestController
class WebFluxController {
Logger log = LoggerFactory.getLogger(WebFluxController.class);
final WebClient webClient;
WebFluxController(WebClient.Builder webClientBuilder) {
this.webClient =
webClientBuilder
// .clientConnector(new JettyClientHttpConnector())
// .clientConnector(new JdkClientHttpConnector())
.baseUrl("https://example.com")
.build();
}
@GetMapping("/webClient")
public Mono<String> webClient() {
log.info("Inside controller");
return this.webClient
.get()
.retrieve()
.toBodilessEntity()
.doOnNext(r -> log.info("Missing trace-id"))
.map(r -> r.getStatusCode().toString());
}
}
With the application.yaml
containing:
logging:
pattern:
level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"
In case of the commented lines (JdkClientHttpConnector
|JettyClientHttpConnector
) the problem does not expose itself with reactor-core 3.5.5 thanks to https://github.com/reactor/reactor-core/pull/3418.
For JdkClientHttpConnector
, the resulting response is delivered as a CompletionStage
wrapped into a Mono
:
https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/JdkClientHttpConnector.java#L107
In this case, we treat the CompletionStage
as a foreign Thread-changing environment and restore the ThreadLocal
values.
For JettyClientHttpConnector
, we're dealing with a foreign Publisher
, which is also wrapped into a Mono
and the ThreadLocal
values are restored by calling Mono.fromDirect()
:
https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/JettyClientHttpConnector.java#L129
The default connector for reactor-netty deals with reactor types, which assume to be aware of ThreadLocal
restoration, but they are not in case of FluxReceive
: https://github.com/reactor/reactor-netty/blob/1.0.x/reactor-netty-core/src/main/java/reactor/netty/channel/FluxReceive.java#L413
When any signal is delivered as a result of Netty handling, it should be handling the ThreadLocal
restoration.
One way to do it would be to implement the restoration in reactor-netty. Another would be to wrap the result with a contextWrite(Function.identity())
in spring-framework codebase, e.g. here: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/ReactorClientHttpConnector.java#L119.
IMO, reactor-netty should implement this restoration. If someone deals with reactor-netty's HttpClient
directly, the same issue would need to be handled. If the restoration is performed on reactor-netty side if the context-propagation library is available and automatic context propagation is enabled, any downstream users wouldn't need to worry about this concern.
Hi guys, any more movement on this?
Is this something we will just live with by adding contextWrite(Function.identity())
or is it a bug?
Thanks
@davidmelia we brainstormed on this and the course of action is that we intend do handle this in reactor-core: https://github.com/reactor/reactor-core/issues/3478. Once this work is ready, depending on whether other projects need to do anything special, we'll create more issues. https://github.com/reactor/reactor-core/issues/3468 contains the current roadmap for the context-propagation effort related to the Hooks.enableAutomaticContextPropagation
feature.
I'm closing this issue in favor of the reactor issues listed above. We'll reopen this if something needs to be done here.
So .tap()
is a workaround to restore context in the case of .releaseBody()
:
.exchangeToMono(response -> response.releaseBody()
.tap(contextView -> new DefaultSignalListener<>() {
})
I have tried this but with no success: cc @chemicL
One way to do it would be to implement the restoration in reactor-netty. Another would be to wrap the result with a contextWrite(Function.identity()) in spring-framework codebase, e.g. here: https://github.com/spring-projects/spring-framework/blob/main/spring-web/src/main/java/org/springframework/http/client/reactive/ReactorClientHttpConnector.java#L119.
@agorbachenko I don't have the context about what you are doing in your application around this code. Please create a minimal reproducer and I can try to offer my help via the Micrometer Slack, where this conversation started.
Using tap
the way you did is not the same as contextWrite(Function.identity())
. However, where I tested it, it does the job for my scenario (tap
is also a barrier for Context
, as this operator can alter it). The flow can depend on the surrounding operators, so let's have a concrete example to discuss.
For those coming to this issue, reactor-core 3.6.0 brings a solution to it. The problem is described in our blog post.
Hi,
I'm diagnosing a production Spring WebFlux 3.0.5 bug where certain requests are losing their tracing context and I've narrowed this down to the following (example https://github.com/davidmelia/spring-cloud-function-zipkin/tree/bodiless_entity_bug_
where toBodilessEntity loses the tracing context. N.B. I am taking advantage of Hooks.enableAutomaticContextPropagation();
The work around is
I assume toBodilessEntity should keep the trace context?
Thanks