Deeply nested capturing lambda sets

Over the past couple weeks we've increasingly observed pathological compiler performance in the presence of deeply nested, capturing lambda sets. See #3449 for one report, and this Zulip thread for a longer discussion.

The current pathological performance occurs when there is a large-enough chain of lambdas called in sequence, and called in such a sequence that each lambda captures the next lambda in sequence (and by extension, its transitive closure). This issue is opened in order to form a technical plan for addressing this problem, and for us to coordinate work on tackling it.

As an example, take the lazy definition of Effect.after:

Effect.after = \@Effect effect, toEffect ->
  @Effect \{} ->
    when toEffect (effect {}) is
      @Effect thunk -> thunk {}

Notice that Effect.after returns an arrow that captures both its parameters effect and toEffect. Thus the following program

Effect.after (getLine) \line ->
    Effect.after (putLine line) \{} ->
        Effect.always {}

has the elaboration

Effect.after (getLine) \line -[f2]->          # Effect {} [thunk2 e2 f2] := {} -[thunk e2 f2]-> {}
    Effect.after (putLine line) \{} -[f1]->   # e2 = Effect {} [thunk1 e1 f1] := {} -[thunk1 e1 f1]-> {}
        Effect.always {}                      # e1 = Effect {} [always] := {} -[always]-> {}

Notice that e2 contains everything e1 contains, and the top-level lambda contains everything e2 contains. So, the total size of the lambda sets types (across the whole program) is quadratic in the depth of the largest chain of captures that capture other capturing lambda sets. That's too many "capture"s in a row, so let's just call this behavior a "lambda capture-chain".

You can see that the total size of the capture-chains will grow to be exponential in the presence of branches. One branch would make the total size 4^d, two branches would make the total size 8^d, etc, where d is the longest depth of a capture-chain.

The performance of this really is quite poor. As of #2226, the False interpreter now takes 2 seconds to compile on my 2021 M1. @Qqwy has observed much worse performance in #3449, where a parser combinator takes hours to compile, if it compiles at all.

Observations

I'll enumerate some observations I've previously made in investigating this problem in the context of #2226.

While it would seem that the largest problem here would be in type unification, I do not think this is the case. Indeed during monomorphization, type unification appears to complete just fine, no matter the size of the lambda set. As a simple test, I checked out https://github.com/rtfeldman/roc/commit/27c26284aadfb2282a46bc722f8fe04f594ec3cb and ran a version of the monomorphizer's layout generator that simply generates an arbitrary lambda set whenever it encounters a lambda set, and does not descend into recursive layout generation for the lambda set. With this change, the exponential increase in compile time observed on 27c26284aadfb2282a46bc722f8fe04f594ec3cb nearly disappeared, decaying into something that looks like a sublinear increase in compile time. Note that in this test, the same amount of type unification work is being done - only the amount of time generating layouts decreases.
Marten has observed large, nested, capturing lambda sets due to parser combinators causing the layout generator to OOM. This is more evidence that most of the work (memory-wise, at least) is going into layout generation.
We can explain type unification not being a significant cost factor in this problem by the union-find construction of the unification data structures. This greatly eliminates identical trees in the type unification forest and decays many unification checks to root pointer equality, especially after types are unified.
Today, the layout generator performs no caching or de-duplication of layouts for any type variable. This means that the runtime of layout generation for a type is at least linear in the total size of the type, whereas for unification it often is sub-linear (for the reasons described above).
Today, layout generation places generated layouts in a bump-allocated arena buffer. While this does not explain the exponential slowdown of compile times, this does mean that there is some extraneous work we may be doing by not packing related layouts more densely next to each other.
This problem is further exacerbated by how functions are specialized (monomorphized) across modules. Suppose the initial example I gave lives in an App module. In order to specialize the Effect.after calls, their specialization must be done in the context of all of the Effect module's specializations. This is because (today) only the Effect module can see all of its types, definitions, etc. This is done for parallelization reasons. The consequence of this is that in order to specialize the nested Effect.after calls in our App module, the App module must export types those calls' types for specialization in the Effect module. In the example above, we ask for two Effect.after external specializations, and we independently export the wanted specialization types. In particular this means that no type variables are reused between the wanted specializations, even though many type variables between the two calls are exactly the same in the App module! And so, that means that the Effect module will see them as totally separate types as well. We've now blown up the forest of types quadratically in the number of trees, when before it may very well have been just a single tree. Moreoever, when the Effect module goes to import the exported types, it does not preserve type variables in the domain being imported from between imports. For example, suppose the Effect module imports a : t1 and b : t1 -> t2 from the external specializations store. The Effect module will import them as a : t11, b : t31 -> t32 - note the relationship between t11 and t31 is lost in the image, though it was present in the domain.

Solutions

Here are some options I think will mitigate or eliminate this problem, in what I consider to be in decreasing order of effectiveness.

Layout caching - properly cache layouts based on the root pointer of a type variable in the type forest. This will reflect the unification forest structure in the layout generation. A natural question is when to invalidate the cache. The simple answer is "whenever we finish specializing a particular type", and while this is correct, this may reduce the effectiveness of caching in the current model of specialization - see point (3) below. There was also a prior attempt to cache layouts based on type roots that was unsuccessful, but I do not have context on why this was - @folkertdev and @rtfeldman may have more insight.
- Note that caching might break down due to let-generalization, but let-generalization (or rather let-specialization) also causes disjoint trees in the unification forest, and that so far has not appeared to be a problem for the type solver. In general my hope is that if we can make layout generation look just like type inference in terms of data structures, this problem will dissipate.
Variable reuse in specialization storage - as mentioned above, we don't reuse variables when they are exported into Storage Subs (the storage for external specializations), and nor do we reuse variables when they are imported from Storage Subs for specialization into a module. We should do this, as it will avoid fragmenting the unification forest. With point (3) I think this will be especially powerful.
Sequence specializations - figure out a way to specialize functions with related types in sequence. In our original example, we'd like the Effect module to specialize the outermost Effect.after call, then the inner Effect.after, then the last Effect.always call in sequence, without any layout cache invalidation, in order to maximize reuse of cached layouts and avoid rewalking type trees. This is tricky to get right, but the fact that the type of Effect.after is nested in the type of the inner Effect.always which is nested in the type of the outermost Effect.after call gives us a starting point - at the very least you can walk type trees to determine the desired specialization order. How to do this more efficiently remains unclear in my head, but there is probably a good way, especially since the syntax tells us how we should expect the types to be nested.
Layout struct-of-arrays - I don't think this will affect the exponential layout-generation problems as much as the other suggested options, but it is an important thing to have regardless. By making the data representation more dense we can reduce cache misses due to distant pointer chases, even in a dense arena.

Finally, I'd like to note two things:

Even if these attempts ultimately do not improve the compilation speed of capture-chains, they are evident gaps in our compiler, and will improve the compile times of other programs with similar characteristics (basically, specializations that are similar or arise because of chained calls).
I have not mentioned how this affects parallelization of specialization within a module, if it all. That is mostly because today we do not parallelize the creation of specializations within a module whatsoever. However, especially if we can get idea (3) working, it gives us a natural partition for parallelization - specializations that must be computed in sequence are the equivalence groups of the partition.

roc-lang / roc

Deeply nested capturing lambda sets #3582

Observations

Solutions