New treatment of relational information in Flambda 2 types

In light of the discussions related to the relations and the realization that this could also be useful to make CSE more robust, I have done some more reseach on the topic and I think that I have a more promising approach than my prototype. This approach takes inspiration from egglog, which combines the best of congruence closure and deductive databases.

This has somewhat grown in scope compared to the initial goal we set with @lthls to extract Is_int and Get_tag relations from the typing env. However, I believe the approach presented below (already discussed with @lthls) would provide a principled and generic solution to many of the issues around Flambda2 losing informations during joins or otherwise due to the aliasing stars not being properly aligned. It also provides a generic interface to introduce additional abstract domains and provides control on how to schedule inter-reductions between domains.

In addition, the approach and underlying data structures can be extended (in a second step) to also provide a principled, and generic implementation of env extensions that would remove the need of threading support for extensions at specific places as was done in #1324.

Details

Roughly, the idea would be to implement a proper deductive database in the typing environment, that can be extended and queried as needed.

We want to store a combination of:

An union-find data structure that associates identifiers to a canonical representative (adapted into the existing Aliases module), and
A set of equations of the form f(x1, ..., xn) = y, where the xi are identifiers (Simples) and y is a value in an arbitrary abstract domain, and f is a function name such as Is_int or Get_tag. Equations are kept in canonical form wrt the union-find, and stored in a logic database (which I will not describe the data-structures here, but it is essentially a lot of tries).

I mention y being an arbitrary abstract domain above because that is one of the powerful aspects of the approach; I will note that one particular domain of interest is the alias domain, in which case y is a (canonical) identifier.

The database is kept canonical using a rebuilding procedure similar to what is used in egglog, wherein we use (database) indices to update the corresponding entries to canonize them. This can cause us to have duplicate entries for a given set of arguments, in which case we perform a meet of the underlying abstract domain (for the alias domain, we record equalities in the union-find and perform a second rebuild to obtain a congruence closure algorithm).

New facts (such as x = Is_int(y)) are added to the database by canonicalizing them in the current environment, then rebuilding the database, which will take care of discovering new aliases/CSE opportunities (e.g. if there is already an entry z = Is_int(y) in the database, we will discover x = z).

Queries, such as finding the set of names X such that y = Is_int(X) for some fixed y, can be made on a rebuilt database by canonicalizing the query first[^0].

Inter-reductions (e.g. when we learn Is_int(x) = 0, we want to update the type of x) are dealt with by registering rules in the database, such as Is_int(x) = 0 ⇒ blocks(x) = Bottom[^1]. These rules are similar to automatic queries and can be run automatically on the new entries in the database only (see the semi-naïve algorithm in the egglog paper).

The elephant in the room is: how do we join two such environments? Logic databases (or databases in general) are not known to deal with disjunction very well. It turns out that we can compute the join of two environments fairly efficiently by leveraging the same operation that is used in queries, i.e. joins (how appropriate).

Ignore for a moment the issue of equality and the union find, i.e. assume that we are joining environments where we only added new facts, but no new equalities. We can use the same technique as with scopes/levels in the existing typing env to only construct "differential databases" for each environment at use that only contain the new facts added since the fork. Once given these differential database, we just have to compute the intersection of facts between all the databases, which can be done by using worst-case optimal join algorithms such as the Leapfrog Triejoin.

Very roughly, Leapfrog Triejoin looks at variables sequentially and computes the intersection per-variable: if the first argument of a function cannot match, it will never look at the second argument.

When joining in this way, we only consider the left-hand side of rules; if we have to join f(a, b) = x with f(a, b) = y, then we will record the entry f(a, b) = z where z is obtained by joining the abstract domains x and y.

If we take into account the union-find and the fact that each environment can add new equalities, things get a bit more complicated. Consider a fact f(a, b, c) = d in a environment at use where a is also equal to a', b to b', and c to c'. At first glance, it looks like we would need to expand this equality to a quickly growing database:

f(a, b, c) = d
f(a, b, c') = d
f(a, b', c) = d
f(a, b', c') = d
f(a', b, c) = d
f(a', b, c') = d
f(a', b', c) = d
f(a', b', c') = d

However, we can leverage sharing between equivalence classes to build compact iterators, and then rely on the leapfrog triejoin to only look at the iterators that are needed. More concretely, the above database, with size O(n^m) where n is the number of variables and m the number of aliases per variable, can be represented in size O(n * m) instead as a trie with sharing:

 a         a'
 |         |
 +----+----+
      |
      v
 b ---+--- b'
 |         |
 +----+----+
      |
      v
 c ---+--- c'
 |         |
 +----+----+
      |
      v
      d

Iterating over all the entries in the database still has complexity O(n^m), but we can rely on the leapfrog triejoin algorithm to eliminate large parts of the iteration by performing the per-argument intersection at each level when computing the join with another database.

The practical complexity is further driven down by the fact that the leapfrog triejoin is a n-way join algorithm[^2], so if an alias does not exist in any of the joined environments it will be immediately eliminated, and the fact that we can compute aliases that are shared between all joined environments before performing the database join (so that some shared equivalence classes need only to be considered once).

Advantages

This approach can be seen as an extension of the current representation, where we have a single unary table type mapping simples to their flambda2 type. A lot of what happens in the typing env currently sort of matches the approach above, but is implemented in an ad-hoc way; notably, aliases are interleaved with regular typing information in a way that makes it hard to perform the efficient join operations described above.

The database approach thus streamlines the existing implementation, and makes it easy to extend it to new domains (whether that is the relational domains for Is_int and Get_tag or other abstract domains such as intervals or bitlists) in a generic way. By providing a clear separation between alias information and other abstract information, and by incorporating alias information in a principled way in the join algorithm, it also solves once and for all ambiguities related to canonicals not being what we expect in one branch or another -- because the result of the join no longer depend on the chosen canonicals (the chosen canonicals might impact the performance of the join, but not the result). In particular, issues such as #3181 should never occur anymore.

This approach also opens up exciting (to me) possibilities by simply encoding appropriate information in tables and registering triggers when queries match; for instance, we discussed with @lthls the possibility of encoding extensions in the following way:

Is_int(x) = 1 ⇒ ext1
Is_int(x) = 0 ⇒ { ext2, Get_tag(x) = 3 ⇒ ext4 }

where the extensions stored on the right-hand side of the ⇒ can be introduced automatically by the database when the query/trigger on the left of the ⇒ become true (and eliminated when it becomes false).

Drawbacks

We need a logic database in the compiler. This is not as big a drawback as it seems at first glance; the underlying data structures are fairly simple (tries) and the relevant code in egglog is on the order of 1k lines of code. We also don't need to implement any sort of parsing or serialization.

Due to the genericity of the approach, there will probably be a slightly higher constant overhead compared to the existing type environment, although the underlying data structures are still similar so that overhead might not even be noticeable. It will be harder to correctly implement shortcuts for specific situations (on the other hand this is good for code quality).

We do need big-endian patricia trees to implement the leapfrog triejoin efficiently, but I think this is something we wanted to do at some point anyways.

Implementation plan

I plan to implement this in a gradual way, in order to minimize the possibility of breakage at each step.

Implement big-endian patricia trees. This is critical for the approach to have good performance, and if for some reason changing the iteration order/compression properties of the patricia trees causes issues we need to re-evaluate.
Implement a proper pre-joining of aliases environment in join_levels. Currently, we do a join of the aliases, but only at the end of the join; the approach outlined above requires this join to be done first. I already have code for this, but there are some performance issues due to the way existential variables are handled in the join (by name, which means that they all must be kept in their equivalence class), so this is interdependent with the next point.
Implement the join procedure described above for the existing type environment, using a single-variable leapfrog join (not a leapfrog triejoin). In particular, this will ensure that the result of joining two existential variables does not depend on the name of the joined variables (currently, we cannot join existential variables with different names). This will allow to ignore existential variables in the join of aliases, because we will now dynamically introduce new names for the existential variables that need to be kept.

At this point, we would have a more robust version of the existing system where the join no longer depends on binding order or on the name of existential variables, but would otherwise behave in an identical way (notably w.r.t relations such as Is_int and Get_tag).

The last two steps are:

Generalize the above system to treat the types environment as a specific table, and introduce new (unary) tables for Is_int and Get_tag that work in a similar way. Remove Is_int and Get_tag information from the types, and look up that information from the separate Is_int and Get_tag tables. I already have a prototype implementation for this step (which was the initial scope of the PR), although it does not use any of the implementation techniques described above. This is probably the step that is most likely to introduce bugs (lose optimizations) and needs to be tested with extra care -- probably with temporary assertions that the new way of getting the information always returns at least as precise information as the old way. At this point, we can also move more unary primitives from CSE to the types.
Generalize the approach by using tries and triejoin to be able to handle functions with more than one argument. At this point, we can migrate more CSE primitives to the types.

[^0]: It is possible to make queries on a non-canonical database without rebuilding the database first, although I am not sure it is worth implementing support for it; see the discussion by Kapur of the trade-offs. In a persistent setting, we probably want to rebuild aggressively to make sure we offset the costs of rebuilding across many queries.

[^1]: In practice and in a first time, the right-hand side of the rule is an arbitrary piece of code that will perform the appropriate reduction.

[^2]: Which on its own is good for complexity -- see the leapfrog triejoin paper for an example where 3 sets of arbitrary size n are joined in constant time to an empty set even though all pairwise joins have a linear size.

@bclement-ocp That may be a dumb question but with this new database, including the rules, would it be possible to implement some kind of peephole optimizations (see the examples in #2188 ), or does the canonicalization/rebuilding of the database prevents that in some way (or would that have prohibitive cost) ?

@Gbury there are no dumb questions. I did not think about this use case too much, but one cool thing about a database is indeed that you can do queries! The short answer is yes, and from my understanding the database would have most of the pieces needed to build the "efficient and incremental pattern matching engine" mentioned in #2188.

The longer answer is that there are two ways you could do peephole optimizations with this:

One is simply to make the query as you go. So for instance to implement (Boolean_not (tag (op x y))) --> tag (negated_op x y) from #2188, you could make a query when simplifying w = Boolean_not (z): ask the database for z = tag (O), O = op(X, Y)) (following Datalog conventions I use lowercase for constants and uppercase for variables bound in the query, so this looks for the solutions of ∃x, y, o. z = tag(o) ∧ o = op(x, y)). This query is going to be relatively fast thanks to the database maintaining indices, I think it is (amortized) O(log n) where n is the number of op and tag relations if the result is empty, and O(k log n) where k is the number of solutions otherwise. Then you can introduce the equality w = tag (negated_op x y) into the database instead of w = Boolean_not (z).
Another is to use the trigger/deduction rules mechanism I mentioned (I currently only plan on implementing a slightly simpler version of this with a single hypothesis, but if we have use cases for the full version, it is easy to extend). You can have a rule (I use Datalog syntax, x :- a, b, c is read as a /\ b /\ c => x):
```
Boolean_not(A) = tag(negated_op(X, Y)) :- A = tag(T), T = op(X, Y).
```
This rule would automatically add the additional equality w = tag (negated_op (x, y)) whenever w = Boolean_not(z) is added to the database with z = tag(negated_op(X, Y)) for some X and Y. It would also automatically add that equality if we already have w = Boolean_not(z) in the database and later learn that z = tag(negated_op(x, y)). You could also make this rule remove the w = Boolean_not(z) binding.

The second approach is closer to equality saturation and what egglog does[^1], it is more complete but also obviously more expensive if applied aggressively. My intuition is that if we are reasonable with the rewrite rules we want to consider, this should scale reasonably well by exploiting some of the incremental tricks from egglog.

[^1]: A major difference is that egglog performs such rule rewrites in a loop until saturation, and checks all rules; I am instead planning to keep enough bookkeeping to know which rules should be checked, which seems more appropriate for our use case.

Some questions that I think are worth considering before embarking on what sounds like a decent amount of work, and could feasibly result in making flambda2's compile times worse:

Do we have any measure of how much the relations that are tracked in the typing env actually matter for optimizations of real programs?
For Is_int and Get_tag: would it be simpler to replace our current representation of sum types and matching by a more traditional case statement? Would that remove the need for these relations in the first place?
How important is it to maintain any information at all across join points for optimizing actual programs? How much of our compile times are caused by attempting to do so?

To be clear, those questions are really for the flambda 2 team as a whole, not @bclement-ocp specifically.

How important is it to maintain any information at all across join points for optimizing actual programs? How much of our compile times are caused by attempting to do so?

I'd say maintaining information across join points is quite important for at least two optimizations:

for unboxing, we rely on type information, so for any unboxing to take place after a join point (or rather at the join point), we need the type information; I'd expect that to happen quite regularly, especially in the case of variants
for continuation lifting, we need at least aliases to be correctly computed after join points; else the continuation lifting process alone might make the code much worse (by losing any information about the variables that were in scope before lifting and not after lifting), and the continuation lifting process is a pre-requisite of match-in-match

As for the time spent in the join algorithm, I don't know, maybe @lthls or @chambart might have better insights. That being said, I think it would be pertinent to see if we could measure that on practical experiments.

* Do we have any measure of how much the relations that are tracked in the typing env actually matter for optimizations of real programs?

There are several kinds of relations stored in the typing env. Aliases are crucial, but I assume that you were not thinking about that. Projection relations are fairly important, but we don't plan to change how they're represented at the moment. Is_int and Get_tag are important too, but I'll discuss more in the answer to your second point. For now, this is all the relations we handle in the typing env. But we are considering adding more:

Comparisons, to be able to propagate conditions down branches
Additional pure primitives, to improve CSE (make it work better modulo aliases)

We don't have numbers for how much the new relations would help. For conditions, the main use would be to remove array bound checks, but I assume that all performance-critical code were the bound checks could be removed by the compiler already use the unsafe primitives; for CSE, it's hard to know how many CSE opportunities we miss without implementing the better version.

* For `Is_int` and `Get_tag`: would it be simpler to replace our current representation of sum types and matching by a more traditional case statement? Would that remove the need for these relations in the first place?

If we had a traditional case instead of our current version, we probably wouldn't need the Is_int and Get_tag relations. But the current version allows a few things that would be more complicated to do with a traditional case statement. For instance, match x with A _ -> 0 | B _ -> 1 | ... can be compiled without branching with our current approach (at least in theory), while with traditional case statements you need a dedicated optimisation for that.

* How important is it to maintain any information at all across join points for optimizing actual programs? How much of our compile times are caused by attempting to do so?

@Gbury has more data than me about the need for join, but I will note that our initial design did not activate join points by default, only starting at -O2; from my discussions with Mark I understood that we're moving towards -O3 being the default, which I expected to mean that compilation times were not a big problem, but maybe we need to rethink this.

First off, I have done some quick and unscientific experiments to see how much time we spend joining environments. We don't really have a good infrastructure to answer this question, but looking at a few arbitrary files from the stdlib and compiler, it looks to be somewhere between 5% and 30% of compile times. This is again an unscientific experiment done on a laptop and on a small random sample of files, but 30% is too high, and higher than we collectively expected. I have opened a PR ( #3298 ) to track join times with -dprofile, and two issues (#3299 and #3300) to set up an infrastructure so we can answer this type of questions, and to address join times specifically.

Regarding @lpw25's other points, in addition to what @Gbury and @lthls said, I want to clarify a couple of things, notably regarding performance.

The implementation plan suggested above is in two parts.

The first part is to rework the join algorithm using the egglog-inspired approach, but without (significantly) changing the way information is currently stored and accessed. This should have no impact outside of join, would make the join of environments significantly more robust (I expect "unnecessary control flow obscures a CSE opportunity" to become a thing of the past) and I think it is likely to actually make flambda2's compile times better, not worse. In fact, I think this is a promising way of tackling #3300.

The second part is to allow more flexibility in the way that information is stored in the typing env. I have thought quite a bit about the performance implications and, for information that we currently store elsewhere and would move to the database (makes the code simpler, makes the information preservation more robust to join), I believe this should have minimal impact on compile times. On the other hand, this would provide a lot of flexibility and grounds for experimenting with new optimizations at a low implementation cost (see e.g. discussion about peephole optimizations). A more relational representation of information would also make it easier to tune the information that we actually want to preserve during joins, which opens up some possibilities to tune the compile times.

Why I think the egglog-inspired join would be faster

The current join algorithm is a bit of a mess, and was not written with aliases in mind. It tries to maintain aliases in a sort of best-effort way, which involve casting a wide net of variables that could be useful in the resulting environment, computing join on these variables, then filtering out the result with the variables that are actually useful. It depends on a consistent order on variables, which in practice does not exist, and also depends on the name of existential variables. This is not very efficient, fragile, and in practice also can miss some important information that we would want in the resulting environment, prompting more fragile hacks.

On the other hand, the egglog-inspired join algorithm deals with aliases and typing information separately, and directly tracks which variables are relevant. The output does not depend on variable definition order nor on the names of existential variables, and there is no need for a post-filtering pass. For these reasons alone I think that this would improve join performance, but the egglog-inspired join algorithm is also able to use an actual n-way join on variables (rather than a sequence of pairwise joins), which should also be faster (potentially much faster) for large matches.

for continuation lifting, we need at least aliases to be correctly computed after join points; else the continuation lifting process alone might make the code much worse (by losing any information about the variables that were in scope before lifting and not after lifting), and the continuation lifting process is a pre-requisite of match-in-match

FWIW I don't think the introduction of artificial join points is fundamentally required for implementing match-in-match, but I see how you've ended up in that situation. Have we attempted to measure the cost of continuation lifting on compilation times.

Aliases are crucial, but I assume that you were not thinking about that.

It depends which aliases you mean. Aliases to in-scope variables are the mechanism by which we do beta-reduction, but aliases via existentially bound logical variables seems much more expensive to manage, and the benefit is much less clear.

Projection relations are fairly important, but we don't plan to change how they're represented at the moment.

Could you say more about why they are important? Is that just for CSE or also for something else?

For instance, match x with A -> 0 | B -> 1 | ... can be compiled without branching with our current approach (at least in theory), while with traditional case statements you need a dedicated optimisation for that.

Whilst it is neat that you get that optimisation for free, I don't think I'd trade it for a more complex and probably noticeably slower optimizer. It appears to me that the current approach to sums and match statements has consumed a very large fraction of the effort that has gone into flambda 2, and my expectation is that it will continue to do so, so I do think it is valuable to spend time considering if it is worth it.

I think it is likely to actually make flambda2's compile times better, not worse

I think you make a compelling case for why -- if you take the information we currently track as fixed -- your approach is a good one and will improve the status quo. I just want to encourage people to think more carefully about whether we really should be tracking the information we currently track, and in particular to do more to measure the cost of tracking that information and to do more to measure the benefit of tracking that information in practice.

FWIW I don't think the introduction of artificial join points is fundamentally required for implementing match-in-match, but I see how you've ended up in that situation. Have we attempted to measure the cost of continuation lifting on compilation times.

Yes we did measure the cost of lifting. While on most files the cost was negligible (I suspect partly because there was not that much situations where it triggered), one file from the compiler testsuite (testsuite/test/basic/patmatch.ml) was able to trigger seemingly infinite opportunities for lifting, allowing to measure the impact it had. On that file, the increase in time was directly correlated with the number of variables introduced by the lifting process. That's why the current way to control lifting is via a "budget" which is the number of variables that lifting is allowed to create.

On patmatch.ml (so in the worst case), there was approximately a 18% increase in compilation time per 1000 lifting budget/variables created, with a pretty good correlation (see this prompt, where x is the budget/1000 and y is the time in seconds).

Finally, note that we could indeed perform math-in-match without lifting, but it's not that easy to estimate beforehand the cost of the potentially duplicated handlers; for that we would likely need some kind of speculative continuation specialization (akin to speculative function inlining), but from discussions I had, people did not seem keen on having that, thus why we've tried to have a more deterministic approach to estimate when to do match-in-match, with lifting.

ocaml-flambda / flambda-backend