roc-lang / roc

A fast, friendly, functional language.
https://roc-lang.org
Universal Permissive License v1.0
4.4k stars 309 forks source link

Deeply nested capturing lambda sets #3582

Closed ayazhafiz closed 2 years ago

ayazhafiz commented 2 years ago

Over the past couple weeks we've increasingly observed pathological compiler performance in the presence of deeply nested, capturing lambda sets. See #3449 for one report, and this Zulip thread for a longer discussion.

The current pathological performance occurs when there is a large-enough chain of lambdas called in sequence, and called in such a sequence that each lambda captures the next lambda in sequence (and by extension, its transitive closure). This issue is opened in order to form a technical plan for addressing this problem, and for us to coordinate work on tackling it.

As an example, take the lazy definition of Effect.after:

Effect.after = \@Effect effect, toEffect ->
  @Effect \{} ->
    when toEffect (effect {}) is
      @Effect thunk -> thunk {}

Notice that Effect.after returns an arrow that captures both its parameters effect and toEffect. Thus the following program

Effect.after (getLine) \line ->
    Effect.after (putLine line) \{} ->
        Effect.always {}

has the elaboration

Effect.after (getLine) \line -[f2]->          # Effect {} [thunk2 e2 f2] := {} -[thunk e2 f2]-> {}
    Effect.after (putLine line) \{} -[f1]->   # e2 = Effect {} [thunk1 e1 f1] := {} -[thunk1 e1 f1]-> {}
        Effect.always {}                      # e1 = Effect {} [always] := {} -[always]-> {}

Notice that e2 contains everything e1 contains, and the top-level lambda contains everything e2 contains. So, the total size of the lambda sets types (across the whole program) is quadratic in the depth of the largest chain of captures that capture other capturing lambda sets. That's too many "capture"s in a row, so let's just call this behavior a "lambda capture-chain".

You can see that the total size of the capture-chains will grow to be exponential in the presence of branches. One branch would make the total size 4^d, two branches would make the total size 8^d, etc, where d is the longest depth of a capture-chain.

The performance of this really is quite poor. As of #2226, the False interpreter now takes 2 seconds to compile on my 2021 M1. @Qqwy has observed much worse performance in #3449, where a parser combinator takes hours to compile, if it compiles at all.

Observations

I'll enumerate some observations I've previously made in investigating this problem in the context of #2226.

Solutions

Here are some options I think will mitigate or eliminate this problem, in what I consider to be in decreasing order of effectiveness.

  1. Layout caching - properly cache layouts based on the root pointer of a type variable in the type forest. This will reflect the unification forest structure in the layout generation. A natural question is when to invalidate the cache. The simple answer is "whenever we finish specializing a particular type", and while this is correct, this may reduce the effectiveness of caching in the current model of specialization - see point (3) below. There was also a prior attempt to cache layouts based on type roots that was unsuccessful, but I do not have context on why this was - @folkertdev and @rtfeldman may have more insight.
    • Note that caching might break down due to let-generalization, but let-generalization (or rather let-specialization) also causes disjoint trees in the unification forest, and that so far has not appeared to be a problem for the type solver. In general my hope is that if we can make layout generation look just like type inference in terms of data structures, this problem will dissipate.
  2. Variable reuse in specialization storage - as mentioned above, we don't reuse variables when they are exported into Storage Subs (the storage for external specializations), and nor do we reuse variables when they are imported from Storage Subs for specialization into a module. We should do this, as it will avoid fragmenting the unification forest. With point (3) I think this will be especially powerful.
  3. Sequence specializations - figure out a way to specialize functions with related types in sequence. In our original example, we'd like the Effect module to specialize the outermost Effect.after call, then the inner Effect.after, then the last Effect.always call in sequence, without any layout cache invalidation, in order to maximize reuse of cached layouts and avoid rewalking type trees. This is tricky to get right, but the fact that the type of Effect.after is nested in the type of the inner Effect.always which is nested in the type of the outermost Effect.after call gives us a starting point - at the very least you can walk type trees to determine the desired specialization order. How to do this more efficiently remains unclear in my head, but there is probably a good way, especially since the syntax tells us how we should expect the types to be nested.
  4. Layout struct-of-arrays - I don't think this will affect the exponential layout-generation problems as much as the other suggested options, but it is an important thing to have regardless. By making the data representation more dense we can reduce cache misses due to distant pointer chases, even in a dense arena.

Finally, I'd like to note two things:

ayazhafiz commented 2 years ago

Kicking this off today. I'm going to try to track work and various investigations for this in https://github.com/orgs/roc-lang/projects/1. Please let me know if you're interested in helping out with the investigation and resolution of this!

ayazhafiz commented 2 years ago

I've archived https://github.com/orgs/roc-lang/projects/1/views/1. After https://github.com/roc-lang/roc/pull/3981 lands, I would like to mark this project complete.