Introduce a Union-Find structure to make constraints' code cleaner

This PR builds on top of #146. It comes from the remark that the code for constraints unification basically inlines the implementation of a union-find data structure, which makes it harder to read, harder to optimise (with eg. the typical union-find optimisations) and more error-prone. The current PR implements a tiny union-find module and

What is a Union-Find data structure?

Not going to dig into history here, but a union-find data structure basically allows to represent efficiently equivalence classes over values as long as there exists a total comparison function. A Union-Find has two main functions:

union merges two equivalence classes together. This typically has the type key -> key -> UnionFind key -> UnionFind key.
find returns a representative of the given equivalence class. This typically has the type key -> UnionFind key with the invariant that for all uf, k1, k2, find k1 uf == find k2 uf if and only if k1 and k2 are in the same equivalence class. This function is typically not exposed to the user to which we rather expose an equivalence test of type key -> key -> UnionFind key -> Bool.

Union-Find can very easily be extended to represent a map from equivalence classes to values. Such a structure has three main functions:

union merges together two equivalence classes. Because these equivalence classes are associated to values, one must specify how to merge the values as well. Typically, the type would be union :: (value -> value -> value) -> key -> key -> UnionFind key value -> UnionFind key value.
find returns a representative of the given equivalence class as well as the associated value. This typically has the type key -> UnionFind key value -> (key, Maybe value). Once again, this function is typically not exposed and we will prefer to expose an equivalence test (key -> key -> UnionFind key value -> Bool) and a lookup (key -> UnionFind key value -> Maybe value).
insert (or add or whatever, I don't think there is a well-defined name for that) adds a mapping from key to value. Because there might already be a mapping from this equivalence class to a value, one must specify how to merge the values. Typically, the type would be insert :: (value -> value -> value) -> key -> value -> UnionFind key value -> UnionFind key value.

Some optimisations can make such structures very efficient with quasi-linear time for both operations. One of them requires that the find operation itself updates the structure, giving the types:

union :: (value -> value -> value) -> key -> key -> UnionFind key value -> UnionFind key value
--       ^^^ unchanged

find :: key -> UnionFind key value -> (UnionFind key value, key, Maybe value)
areEquivalent :: key -> key -> UnionFind key value -> (UnionFind key value, Bool)
lookup :: key -> UnionFind key value -> (UnionFind key value, Maybe value)

insert :: (value -> value -> value) -> key -> value -> UnionFind key value -> UnionFind key value
--        ^^^ unchanged

(Note: a state monad would make those types much nicer.)

Where is this inlined Union-Find?

As of f12b951 on #146, a set of constraints is represented as follows:

https://github.com/tweag/pirouette/blob/f12b95104b8b17996228587867c85f193944b7ab/src/Pirouette/SMT/Constraints.hs#L29-L41

Note in particular the fields csAssignments :: M.Map meta (TermMeta lang meta) and csMetaEq :: M.Map meta meta that only make sense when taken together:

csMetaEq maps meta-variables to other meta-variables with the invariant that if v maps to v', then v' is smaller than v for some comparison (here, the lexicographic order, but it does not matter). csMetaEq therefore represents an equivalence relation of meta-variables as well as a way to pick their representative: the smallest element according to the comparison.
csAssignments maps meta-variables to terms with the invariant (not mentioned in the comments) that a key in csAssignments is always a representative in csMetaEq. It is therefore intended as a map from equivalence classes to terms.

Keeping those invariants brings complexity to the unification functions. See for instance unifyMetaWith:

https://github.com/tweag/pirouette/blob/f12b95104b8b17996228587867c85f193944b7ab/src/Pirouette/SMT/Constraints.hs#L135-L147

and in particular line 142 that resolves the representative of the equivalence class of the given meta-variable. In this example, abstracting and using a union-find library would bring two advantages. First, it would make the code simpler by removing some cases. Second, it would make clearer what happens in some corner cases. For instance, in unifyMetaWith, what happens if there is an equivalence class in csMetaEq but no binding for it in csAssignments? The variable is considered to be a new meta-variable and is sent to unifyNewMetaWith. Is this a bug? Is this intended? Because it sure seem to be violating unifyNewMetaWith's invariant.

How would it look with a Union-Find?

Look at the code! In 178003f21fbbb5d5aeb7a9318a88bcdc9af81d2f, there is in particular:

https://github.com/tweag/pirouette/blob/178003f21fbbb5d5aeb7a9318a88bcdc9af81d2f/src/Pirouette/SMT/Constraints.hs#L30-L40

and

https://github.com/tweag/pirouette/blob/178003f21fbbb5d5aeb7a9318a88bcdc9af81d2f/src/Pirouette/SMT/Constraints.hs#L130-L144

Additional Notes and Questions

Is it possible for meta-variables to be equivalent (and mapped in csMetaEq as such) without having any binding in csAssignments? I tried to assume that it wasn't but the “OHearn” test (and only that one) started to fail. It goes against my initial intuition but I see now why it would make sense. I would therefore need someone's opinion on this.
Some previous functionalities are missing (eg. the Monoid instance for ConstraintSet; actually, that one seems buggy to me in the previous implementation). They do not seem used anywhere, though?
At this point, I did not implement any of the optimisations of the union-find data structure. The union-find's functions' interface already prepares for this, though.
Maybe we want to give a better interface to this union-find module by using a state monad somewhere?
A persistent union-find is a peculiar data structure in that the function find modifies the structure without changing its semantics. In this particular case, it means that it is possible to make find actually mutate the memory without violating the interesting properties of persistence. This gives an extra tiny speedup in case the union-find is used in several places without much differences. This makes things a bit dirty though.

For the record (understanding that it's easy to change in the future), in my experience, a mutable implementation (based on ST) is quite a bit faster. So if performance is an issue it's something to think about.

Note at this point that the performance seem to be basically the same as before, if not a bit slower (because more indirections?). Implementing the optimisations should improve this drastically, if it ever becomes relevant.

The OHearn test has caused issues before, but it makes sense that we could have metavariables potentially equal that are unassigned. But maybe we need some more complex tests if there is generally only this one test where these edge cases come into play.

It is definitely weird that only one test changes behaviour depending on whether or not we allow meta-variables to be equal without assignment. We really need more tests that expose that kind of edge cases, indeed.

On a similar note, I mentioned before but we probably would want some tests just testing the union-find implementation on its own, especially once the optimizations are implemented.

Definitely! It will also be good practice for me to have a look at how tests are done. I will try adding tests for the union-find implementation and also maybe some extra tests for the symbolic engine in general (which then would probably escape the scope of this PR).

The optimizations would be nice, and it would also be nice to see how much the optimizations affect things. For example, I don't really have any intuition as to how often we do nontrivial unions of metavariables or anything

Well, the weird thing is that, by mimicking the previous implementation, I only ended up using lookup, trivialUnion and trivialInsert, while the whole purpose of the union-find is to have this “fancy” union which takes care of things for you. Basically, I think it shows how deeply-inlined the union-find was, and I think it calls for a full rewrite of unifyMetaWith and unifyNewMetaWith into (almost only) one single call to union with the right merge function.

To be honest, I am not even sure at this point that I understand how this algorithm can possibly work without ever triggering a non-trivial union. I suppose there is an important property of meta-variables that makes this happen (or rather not happen), but I don't see it yet.

It seems like using a State monad would make things a bit nicer, and apparently maybe even faster!

It would definitely make the interface of the UnionFind module look nicer. And also maybe it would make easier, when using UnionFind, to propagate the modifications (and therefore the optimisations, when we implement them)?

I don't think I understand what persistent union-find means. Is the idea that we would no longer use a tree-like structure to keep track of the equivalences?

This question was the occasion for me to read on the difference between “persistent” and “immutable”. In short:

Persistence is a property of the interface: updates (eg. with Data.Map.insert) give you a new pointer to an updated structure but old pointers are still valid. Observations (eg. with Data.Map.lookup) cannot distinguish the state of an old pointer before or after updates.
Immutability is an implementation technique where you don't allow rewriting pointers. In particular, it trivially gives you persistence of your data structures.

(If you already knew all that, sorry, then it was only for me :p) My point was therefore that, as of now, UnionFind provides an immutable data-structure. But union-find is a funny case where it is easy to make find have a side-effect instead of returning an updated structure while keeping the persistent interface. You get the nice properties of persistence but the interface becomes slightly simpler (because find doesn't need to return a union-find anymore). Additionally, you get the speed-up of the optimisation and this speed-up is shared between diverging versions of the union-find, ie. if you start from a union-find u and update it to v, then start from u again and update it to w, and then you call find on v, you gain speed-ups for further calls to find on v but also on w. In our case where we keep branching at each if-then-else, it might actually give us an extra speed-up. This is a consideration for muuuuch later, and it is also out of my Haskell skills because I basically have no idea how to use/implement mutable structures :p

Overall, this seems very nice, good work! I suppose if this ultimately gets merged (or rather if Victor's PR gets merged), the most pressing issues would be adding optimizations and adding tests.

I'll start by adding tests because it is also very good practice for me. This will allow us to measure the optimisations if we ever decide to implement them.

Thanks for the response, I understand the persistence property now. As for tests, if you would like to pair program some at some point this week or next, let me know. It would be a good learning experience for me as well.

As for tests, if you would like to pair program some at some point this week or next, let me know. It would be a good learning experience for me as well.

That sounds great; let's do that!

May I inquire why this PR is still ongoing? It's been open for a month, it should have been merged by now.

@0xd34df00d

That's a really good separation of responsibilities, thanks! In the long run, it'd be cool to have that as a separate library on Hackage, as I can see it useful elsewhere as well, and other libraries have some downsides (at least from a quick survey).

I had the same impression about other libraries which is why I decided to go for a home implementation. I suppose it could be separated from Pirouette indeed.

Having said that, I only have minor concerns. In particular, I'm not sure that Ancestor should really contain Maybe — if the UF user has a use case with partiality, they can just have UnionFind key (Maybe val). At least, I don't feel like Nothing is intrinsic to the UF itself. I see there was some discussion about that already, but I'm not sure how relevant it is now, so we can catch up on that separately.

It is a good question and I'm not sure what the right answer is. For now, Nothing is used to convey the fact that we know of an equivalence class but we haven't bound a value to it. It seems sometimes important, for instance for code that learns equivalences and assignments one by one and not necessarily in an order that would work without a Maybe.

The code would probably be nicer without a Maybe under Ancestor. At least, it would avoid merging two different pieces of information into one, and it would avoid potential bugs like the one discussed in https://github.com/tweag/pirouette/pull/150#discussion_r991250090.

Also, ideally it'd be great to have more tests, especially property-based ones, but that can wait until when (and if) the UF implementation will be factored out to a separate library. Passing the usual pirouette tests is good enough for now, since they exercise enough of UF machinery.

I would be all for property-based tests for this little piece of code. I hope that the usual Pirouette tests are good enough but I am not sure. Do we have property-based testing machinery in the Pirouette project or would that be adding big dependencies?

@aspiwack

May I inquire why this PR is still ongoing? It's been open for a month, it should have been merged by now.

You are completely right, this PR should be merged by now (or closed; but not stalled). I left it hanging because I wasn't sure that it was the right way to go. I will make sure that we review and merge it quickly.

This PR seems ready to be merged. It passes all the tests that were previously there as well as the newly-introduced ones. It now implements the classic optimisations of the Union-Find structure. It could be further optimised by using a structure different from a Map (but still persistent), but I don't think it makes much sense to go through the trouble right now. We can always iterate later on if we feel the need.

tweag / pirouette