Micronaut's ServerRequestContext may get corrupted

stepanv commented 1 year ago

Hi team,

we stumbled on an intermittent bug that ServerRequestContext may get corrupted. By corrupted I mean that it is still available but not in the expected state (e.g., the object I store there is later found missing).

It happens only under given conditions (see also a reproducer bellow):

@Controller has a method returning a Mono<HttpResponse<Flux<?>>>
A thread switch (e.g., due to an async call) occurs during the execution of the enclosing Mono
A thread switch (e.g., due to an async call) occurs in the execution of the inner Flux
A higher number of parallel request is being served

Why do we do the 2 layer reactive types controller?

We need the enclosing Mono to reactively decide (because it depends on a result from another backend) whether we want to return 200 and continue processing the request OR to return 4xy error.
When we return 200 we still need to reactively return a series of objects, some get returned sooner other later (e.g., after another backend call)
After all, it seems to be supported by Micronaut as it works under a lower load :)

Steps to Reproduce

Following @MicronautTest annotated test demonstrates the issue:

import io.micronaut.http.HttpRequest;
import io.micronaut.http.HttpResponse;
import io.micronaut.http.annotation.Controller;
import io.micronaut.http.annotation.Get;
import io.micronaut.http.client.HttpClient;
import io.micronaut.http.client.StreamingHttpClient;
import io.micronaut.http.client.annotation.Client;
import io.micronaut.runtime.http.scope.RequestScope;
import io.micronaut.test.extensions.junit5.annotation.MicronautTest;
import jakarta.inject.Inject;
import lombok.Data;
import lombok.NonNull;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.ValueSource;
import reactor.core.publisher.Flux;
import reactor.core.publisher.Mono;
import reactor.core.scheduler.Schedulers;
import reactor.test.StepVerifier;

import java.util.concurrent.Executors;

@MicronautTest
class DisappearingContextTest {

    @RequestScope
    @Data
    static class ContextBean {
        private volatile Boolean flag;
    }

    @Controller("/reproducer")
    static class EntryController {

        @Inject
        @Client("/control")
        HttpClient client;

        @Inject
        private ContextBean contextBean;

        @Get("/mono-http-response-flux")
        public Mono<HttpResponse<Flux<String>>> monoHttpResponseFlux() {
            return Mono.from(client.exchange("/first", String.class))
                    .map(first -> {
                        contextBean.setFlag(true);

                        var flux = Mono.from(client.exchange("/second", String.class))
                                .flatMapMany(second -> {
                                    if (contextBean.getFlag() == null) {
                                        throw new IllegalStateException("!!CONTEXT IS CORRUPTED!!");
                                    }

                                    return Flux.just("ok", first.body(), second.body());
                                })
                                .onErrorResume(Exception.class,
                                        e -> Mono.just("error: " + e.getMessage()));

                        return HttpResponse.ok(flux);
                    });
        }
    }

    @Controller("/control")
    static class DummyController {

        @Get("/first")
        Mono<String> first() {
            return Mono.just("first");
        }

        @Get("/second")
        Mono<String> second() {
            return Mono.just("second");
        }
    }

    @Inject
    @Client("/reproducer/mono-http-response-flux")
    StreamingHttpClient client;

    @NonNull
    private Mono<String> callEndpoint() {
        return Mono.from(client.retrieve(HttpRequest.GET("/"), String.class));
    }

    @Test
    void single() {
        var mono = callEndpoint();

        StepVerifier.create(mono)
                .expectNext("[ok,first,second]")
                .verifyComplete();
    }

    @ParameterizedTest
    @ValueSource(ints = {1, 2, 4, 6, 8, 10, 15, 20, 30, 60})
    void parallel(int parallel) {
        var executor = Executors.newFixedThreadPool(parallel);

        var strings = Flux.range(0, parallel)
                .publishOn(Schedulers.fromExecutor(executor))
                .flatMap(num -> callEndpoint());

        StepVerifier.create(strings)
                .thenConsumeWhile(v -> v.startsWith("[ok"))
                .verifyComplete();
    }
}

Expected Behaviour

The tests should all suceed, especially the parallel ones (see the parallel() test with higher number denoting the number of parallel executions).

Actual Behaviour

For higher volume of parallel requests, Micronaut's ServerRequestContext (which is what @RequestScope is founded on) may get corrupted. In the above described reproducer, the corruption is detected when a set field of a @RequestScope bean is later found as still null. The same problem can be demonstrated by setting an attribute ServerRequestContext.currentRequest().get().setAttribute() which is later intermittently found as not present.

Environment Information

Operating System: Darwin Kernel Version 22.1.0
Micronaut Version: 3.5.1
JDK Version: openjdk 17.0.3 2022-04-19 LTS

Example Application

If needed, I can create a test in groovy+spock in micronaut-core codebase as a full reproducer.

graemerocher commented 1 year ago

a full reproducer would be good. Thanks

stepanv commented 1 year ago

ok, I'll prepare some in next hours, thx @graemerocher for such prompt response!

stepanv commented 1 year ago

@graemerocher I added 2 PRs:

for 3.8.x where the test is indeed failing: https://github.com/micronaut-projects/micronaut-core/pull/8343
surprise, for 4.0.x it seems to be running ok: https://github.com/micronaut-projects/micronaut-core/pull/8344

So now the obvious question:

when can we expect the release of Micronaut 4?
do we need to anything about the 3.x versions?

Thx!

stepanv commented 1 year ago

I do confirm that when using Flux.deferContextual() and Mono.deferContextual() to access the HttpRequest and its attributes, nothing gets wrong in my reproducer even with Micronaut 3.8.

So the question is what to do with @RequestScope on Micronaut 3.x which is using unsupported (buggy) ServerRequestContext.currentRequest()?

graemerocher commented 1 year ago

I would say we should improve the documentation state that @RequestScope can only be used when using imperative / blocking code since its reliance on threadlocals mean it will never work well with reactive. It may "work" by chance in 4.0.x but I have no doubt there is another combination of reactive operators that will break it since reactive is fundamentally incompatible with thread local usage.

stepanv commented 1 year ago

By any chance, isn't it possible to use Reactor's Hooks to always align the ThreadLocal value with what is in the Reactor's context? This way the injection framework should be able to pick up correct instance from the ThreadLocal. Just brainstorming...

stepanv commented 1 year ago

@graemerocher I found out your commit https://github.com/micronaut-projects/micronaut-core/commit/79c56c65b40ca361904f2a723041379b15c7f26f on 4.0.x where you removed the ReactorInstrumentation actually fixes the problem. I'm unable to run the tests locally so I'm not sure if something else gets broken. See my PR https://github.com/micronaut-projects/micronaut-core/pull/8413 for branch 3.7.x.

Anyway, it would be great if Micronaut was capable (or at least tried) of maintaining the correct state of the ThreadLocals. So far it worked quite well. Thanks to that the application code is much cleaner and much better testable. Additionally there are things that are impossible to achieve without ThreadLocals (e.g., logging with MDC contextual info).

stepanv commented 1 year ago

nevermind, without the ReactorInstrumentation many other things stop working

yawkat commented 1 year ago

TL simply cannot work well in pooled, asynchronous situations (like reactive). It's a losing battle. It will be better with virtual threads.

stepanv commented 1 year ago

In the same time, you guys were giving a lot of effort to make Micronaut's TL based RequestScope work even in the reactive world. Frankly, this is one of the reasons to use Micronaut. It is hacky and possibly not working under every scenario, but it's worth the effort. Look at vmware, they hacked the x86 arch to make virtualization work and it's a billion $ worth company.

Not sure what is the plan for the Micronaut 4.0.x but we need this feature to be working in 3.x. Our production is intermittently failing when under a load.

To mitigate the problem, I found out my reproducer gets fixed when the TL is set through the Reactor Context. It even seems like you guys intended that with the ReactiveInvocationInstrumenterFactory but didn't finish it. Please take a look at https://github.com/micronaut-projects/micronaut-core/pull/8421 PR and let me know what you think. cc @graemerocher , @yawkat , @dstepanov

micronaut-projects / micronaut-core