reactor / reactor-core

Non-Blocking Reactive Foundation for the JVM
http://projectreactor.io
Apache License 2.0
4.91k stars 1.19k forks source link

Uncontrolled Recursion causing StackOverflowError #3212

Closed sfc-gh-aramarathinam closed 1 month ago

sfc-gh-aramarathinam commented 1 year ago

StackOverflowError. Uncontrolled recursion in reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves()

Error Stack: [java.lang.Iterable.forEach():74 reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves():433 reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1():433 java.lang.Iterable.forEach():75 reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves():433 reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1():433 [...]

Expected Behavior

Stop the recursion after a certain depth and bubble the original error.

Actual Behavior

Stack overflow.

chemicL commented 1 year ago

Thank you for the report, can you provide a reproducer to help better understand the circumstances when this happens?

sfc-gh-aramarathinam commented 1 year ago

It's not clear how we repro this yet. Saw this error in the logs.

chemicL commented 1 year ago

Please reopen if you manage to create a reproducer.

lucamarchi commented 1 year ago

Hi everyone, the same error has been observed today. Unluckly, we are not able to reproduce it.

Stacktrace:

2023-02-23 15:18:38.369 ERROR 1 --- [undedElastic-12] Schedulers () : Scheduler worker in group main failed with an uncaught exception

java.lang.StackOverflowError: null
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.lambda$findPathToLeaves$1(FluxOnAssembly.java:433)
    at java.base/java.lang.Iterable.forEach(Iterable.java:75)
    at reactor.core.publisher.FluxOnAssembly$OnAssemblyException.findPathToLeaves(FluxOnAssembly.java:433)

Environment:

rohitdumbre86 commented 5 months ago

We also see the same error in our framework that uses reactive core 3.4.34. Can we get more insight on this?

chemicL commented 5 months ago

If you are able to identify the pipeline that causes this, please share. The stackoverflow might be just a side effect of an incorrectly crafted pipeline, e.g. assemblying a reactive chain using a loop and stacking operators instead of using operators that perform looping. Without a reproducer there's no point in investigating. The code could potentially avoid recursion but that would turn into using a queue-based looping approach, which can cause other issues such as heap pollution, so it won't necessarily resolve the original problem.

rohitdumbre86 commented 5 months ago

@chemicL A little more debugging found that in reactor core. The FluxOnAssembly#findPathToLeaves is a recursive function and is invoked by getMessage(). We see these exceptions thrown only on error i.e. whenever i believe getMessage() is invoked. We were not able to recreate it.

alex-suchman commented 5 months ago

@chemicL I was able to reproduce the issue .

Seems that each time we attempt to print an error form mono.error , reactor core tries to build stacktrace string , by recursively walking up mono chain . in case mono chain is very long , it would end up with stackoverflow error.

the following code snipped can quickly reproduce the error

Mono<Object> mono = Mono.just(new Object());
long i=0;
while(i<10000) {
    mono = mono.flatMap(o -> Mono.just(new Object()));
    i++;
}

return mono.flatMap(o -> Mono.error(new Exception()));

(NOTE - exact iterations count might which trigger stackoverflow error depend on jvm stack size )

adding a configurable limit to the recursion level should provide a quick protection against such issue .

chemicL commented 5 months ago

I ran the above and I get a java.lang.StackOverflowError (SOE) with repeated:

at reactor.core.publisher.MonoFlatMap$FlatMapMain.request(MonoFlatMap.java:194)

or

at reactor.core.publisher.MonoFlatMap$FlatMapMain.cancel(MonoFlatMap.java:199)

Depending on the loop size chosen (5000, 10_000). It's not coming from FluxOnAssembly#findPathToLeaves in my case.

In order to get that I used -Dreactor.trace.operatorStacktrace=true and reworked the example:

    public static void main(String[] args) {
        Mono<Object> mono = Mono.error(new RuntimeException("oops"));
        long i = 0;
        int loops = 2850;
        while (i < loops) {
            mono = mono.flatMap(o -> Mono.just(new Object()));
            i++;
        }

        mono.flatMap(o -> Mono.error(new Exception()))
            .doOnError(e -> System.out.println(e.getMessage()))
            .subscribe();
    }

For loops = 2850 sometimes there's no SOE and sometimes it triggers so it seems to be around the threshold.

As I said before, it looks like the FluxOnAssembly#findPathToLeaves SOE is just manifesting itself as a result of something else that is broken and it's a pipeline that is too long. Assembling a pipeline using a loop like the above might not be the best choice. Perhaps it is justified in certain scenarios, but the one presented is not a justification for making adjustments in my view.

adding a configurable limit to the recursion level should provide a quick protection against such issue

Where would this be configurable? I can't imagine anyone actually using that configuration - if they had to, they'd already know what they are dealing with. java.lang.StackOverflowError fires exactly at the right limit -> so should we aim at just one method evaluation before it gets triggered? I think it's unfeasible.

Assuming there is a valid, reasonable reactive pipeline which causes this to be triggered, what would be the expected behaviour?

I reworked the recurrent findPathToLeaves method to use a loop:

    void findPathToLeaves(ObservedAtInformationNode node, List<List<ObservedAtInformationNode>> rootPaths) {
            Queue<ObservedAtInformationNode> items = new LinkedList<>();

            items.add(node);

            do {
                if (node.children.isEmpty()) {
                    List<ObservedAtInformationNode> pathForLeaf = new LinkedList<>();
                    ObservedAtInformationNode traversed = node;
                    while (traversed != null && traversed != root) {
                        pathForLeaf.add(0, traversed);
                        traversed = traversed.parent;
                    }
                    rootPaths.add(pathForLeaf);
                }
                items.addAll(node.children);
            } while ((node = items.poll()) != null);
        }

Now the threshold for SOE is around 4190 and comes from the regular reactive operations (request).

Even though it looks like a solution I am not convinced that it makes sense to change the implementation. For this case it resolves the problem but might introduces a cost of allocating a queue backed by heap memory instead of using the stack which could outperform potential optimizations that we could introduce. Although, it might be completely the opposite :)

As this is of low priority, I'd consider introducing this change if a JMH benchmark was provided. Is anyone willing to validate whether this implementation is acceptable for regular pipelines?

rohitdumbre86 commented 5 months ago

@chemicL Just for clarification, what exactly are you referring to long pipeline? So something like this would qualify as a long pipeline? E.g.

Flux.just(someString)
       .map(s->doSomething1())
      .map(s->doSomething2())
      .map(s->doSomething3())
      .map(s->doSomething4())
      .map(s->doSomething5())
  and doSomething1() is a `Mono.just("string").map(s->callSomething()).map(s->callSomething())`
alex-suchman commented 5 months ago

@chemicL thanks for looking into it .

the current state is in case mono chain is too long , the system behaves unexpectedly , as I assume that stackoverflow error would not be a desired behaviour in such case.

As I said before, it looks like the FluxOnAssembly#findPathToLeaves SOE is just manifesting itself as a result of something else that is broken and it's a pipeline that is too long. Assembling a pipeline using a loop like the above might not be the best choice. Perhaps it is justified in certain scenarios, but the one presented is not a justification for making adjustments in my view.

I assume that in real world scenario we can reach such cases due to regular long call graph , common to enterprise apps :

libA -> calls lib B -> calls lib C -> and so on .. e.g. in java spring itis not uncommon to have pretty long call graphs.

Where would this be configurable? I can't imagine anyone actually using that configuration - if they had to, they'd already know what they are dealing with. java.lang.StackOverflowError fires exactly at the right limit -> so should we aim at just one method evaluation before it gets triggered? I think it's unfeasible.

I am guessing some reasonable default value with an optional escape hatch for a user to change it via optional config if someone really knows what he's doing .

most users probably have no need for a 10K lines long stack trace.

Even though it looks like a solution I am not convinced that it makes sense to change the implementation. For this case it resolves the problem but might introduces a cost of allocating a queue backed by heap memory instead of using the stack which could outperform potential optimizations that we could introduce. Although, it might be completely the opposite :)

what would be expected behaviour in such cases. I assume that currently there are no limits enforced on depth of mono chains , so in theory during stack trace creation , we could deal with any length of data .

there should probably be some kind of graceful degradation instead of current unhandled exception . e.g. retuning truncated data in case of getMesssage .

again , thank you for looking into it !

chemicL commented 1 month ago

@rohitdumbre86 I assume you mean .flatMap(s -> doSomething1()). That doesn't like like a "long pipeline" - it's 6 operators in the main chain and 3 in the internal one of flatMap. Assuming all the map operators were to be flatMap we have 6*3 = 18 operators, which is reasonable and won't call this issue to surface.

@alex-suchman and @rohitdumbre86

I mean artificially attaching operators in a loop-like construct can be an issue if the number of operators in a chain grows beyond 1000 (depending on the available resources of course it will be a different number, so it's just an approximation).

I'm closing the issue as I don't feel it's justified to make any adjustments. Were you to provide a non-contrived reproducer where the logic is a real-world scenario that works perfectly fine on a particular machine but FluxOnAssembly#findPathToLeaves causes OOM, I think we could consider a contribution to address this issue. However, it is not a priority for the team right now and we won't address it at this time. The contribution should also incorporate a JMH benchmark with results and memory characteristics highlighting the chosen approach is not worse performance-wise from the existing solution.