tweag / pirouette

Language-generic workbench for building static analysis
MIT License
47 stars 2 forks source link

Introduce a Union-Find structure to make constraints' code cleaner #150

Closed Niols closed 1 year ago

Niols commented 2 years ago

This PR builds on top of #146. It comes from the remark that the code for constraints unification basically inlines the implementation of a union-find data structure, which makes it harder to read, harder to optimise (with eg. the typical union-find optimisations) and more error-prone. The current PR implements a tiny union-find module and

What is a Union-Find data structure?

Not going to dig into history here, but a union-find data structure basically allows to represent efficiently equivalence classes over values as long as there exists a total comparison function. A Union-Find has two main functions:

Union-Find can very easily be extended to represent a map from equivalence classes to values. Such a structure has three main functions:

Some optimisations can make such structures very efficient with quasi-linear time for both operations. One of them requires that the find operation itself updates the structure, giving the types:

union :: (value -> value -> value) -> key -> key -> UnionFind key value -> UnionFind key value
--       ^^^ unchanged

find :: key -> UnionFind key value -> (UnionFind key value, key, Maybe value)
areEquivalent :: key -> key -> UnionFind key value -> (UnionFind key value, Bool)
lookup :: key -> UnionFind key value -> (UnionFind key value, Maybe value)

insert :: (value -> value -> value) -> key -> value -> UnionFind key value -> UnionFind key value
--        ^^^ unchanged

(Note: a state monad would make those types much nicer.)

Where is this inlined Union-Find?

As of f12b951 on #146, a set of constraints is represented as follows:

https://github.com/tweag/pirouette/blob/f12b95104b8b17996228587867c85f193944b7ab/src/Pirouette/SMT/Constraints.hs#L29-L41

Note in particular the fields csAssignments :: M.Map meta (TermMeta lang meta) and csMetaEq :: M.Map meta meta that only make sense when taken together:

Keeping those invariants brings complexity to the unification functions. See for instance unifyMetaWith:

https://github.com/tweag/pirouette/blob/f12b95104b8b17996228587867c85f193944b7ab/src/Pirouette/SMT/Constraints.hs#L135-L147

and in particular line 142 that resolves the representative of the equivalence class of the given meta-variable. In this example, abstracting and using a union-find library would bring two advantages. First, it would make the code simpler by removing some cases. Second, it would make clearer what happens in some corner cases. For instance, in unifyMetaWith, what happens if there is an equivalence class in csMetaEq but no binding for it in csAssignments? The variable is considered to be a new meta-variable and is sent to unifyNewMetaWith. Is this a bug? Is this intended? Because it sure seem to be violating unifyNewMetaWith's invariant.

How would it look with a Union-Find?

Look at the code! In 178003f21fbbb5d5aeb7a9318a88bcdc9af81d2f, there is in particular:

https://github.com/tweag/pirouette/blob/178003f21fbbb5d5aeb7a9318a88bcdc9af81d2f/src/Pirouette/SMT/Constraints.hs#L30-L40

and

https://github.com/tweag/pirouette/blob/178003f21fbbb5d5aeb7a9318a88bcdc9af81d2f/src/Pirouette/SMT/Constraints.hs#L130-L144

Additional Notes and Questions

  1. Is it possible for meta-variables to be equivalent (and mapped in csMetaEq as such) without having any binding in csAssignments? I tried to assume that it wasn't but the “OHearn” test (and only that one) started to fail. It goes against my initial intuition but I see now why it would make sense. I would therefore need someone's opinion on this.

  2. Some previous functionalities are missing (eg. the Monoid instance for ConstraintSet; actually, that one seems buggy to me in the previous implementation). They do not seem used anywhere, though?

  3. At this point, I did not implement any of the optimisations of the union-find data structure. The union-find's functions' interface already prepares for this, though.

  4. Maybe we want to give a better interface to this union-find module by using a state monad somewhere?

  5. A persistent union-find is a peculiar data structure in that the function find modifies the structure without changing its semantics. In this particular case, it means that it is possible to make find actually mutate the memory without violating the interesting properties of persistence. This gives an extra tiny speedup in case the union-find is used in several places without much differences. This makes things a bit dirty though.

aspiwack commented 2 years ago

For the record (understanding that it's easy to change in the future), in my experience, a mutable implementation (based on ST) is quite a bit faster. So if performance is an issue it's something to think about.

Niols commented 2 years ago

Note at this point that the performance seem to be basically the same as before, if not a bit slower (because more indirections?). Implementing the optimisations should improve this drastically, if it ever becomes relevant.

Niols commented 2 years ago
  • The OHearn test has caused issues before, but it makes sense that we could have metavariables potentially equal that are unassigned. But maybe we need some more complex tests if there is generally only this one test where these edge cases come into play.

It is definitely weird that only one test changes behaviour depending on whether or not we allow meta-variables to be equal without assignment. We really need more tests that expose that kind of edge cases, indeed.

  • On a similar note, I mentioned before but we probably would want some tests just testing the union-find implementation on its own, especially once the optimizations are implemented.

Definitely! It will also be good practice for me to have a look at how tests are done. I will try adding tests for the union-find implementation and also maybe some extra tests for the symbolic engine in general (which then would probably escape the scope of this PR).

  • The optimizations would be nice, and it would also be nice to see how much the optimizations affect things. For example, I don't really have any intuition as to how often we do nontrivial unions of metavariables or anything

Well, the weird thing is that, by mimicking the previous implementation, I only ended up using lookup, trivialUnion and trivialInsert, while the whole purpose of the union-find is to have this “fancy” union which takes care of things for you. Basically, I think it shows how deeply-inlined the union-find was, and I think it calls for a full rewrite of unifyMetaWith and unifyNewMetaWith into (almost only) one single call to union with the right merge function.

To be honest, I am not even sure at this point that I understand how this algorithm can possibly work without ever triggering a non-trivial union. I suppose there is an important property of meta-variables that makes this happen (or rather not happen), but I don't see it yet.

  • It seems like using a State monad would make things a bit nicer, and apparently maybe even faster!

It would definitely make the interface of the UnionFind module look nicer. And also maybe it would make easier, when using UnionFind, to propagate the modifications (and therefore the optimisations, when we implement them)?

  • I don't think I understand what persistent union-find means. Is the idea that we would no longer use a tree-like structure to keep track of the equivalences?

This question was the occasion for me to read on the difference between “persistent” and “immutable”. In short:

(If you already knew all that, sorry, then it was only for me :p) My point was therefore that, as of now, UnionFind provides an immutable data-structure. But union-find is a funny case where it is easy to make find have a side-effect instead of returning an updated structure while keeping the persistent interface. You get the nice properties of persistence but the interface becomes slightly simpler (because find doesn't need to return a union-find anymore). Additionally, you get the speed-up of the optimisation and this speed-up is shared between diverging versions of the union-find, ie. if you start from a union-find u and update it to v, then start from u again and update it to w, and then you call find on v, you gain speed-ups for further calls to find on v but also on w. In our case where we keep branching at each if-then-else, it might actually give us an extra speed-up. This is a consideration for muuuuch later, and it is also out of my Haskell skills because I basically have no idea how to use/implement mutable structures :p

Overall, this seems very nice, good work! I suppose if this ultimately gets merged (or rather if Victor's PR gets merged), the most pressing issues would be adding optimizations and adding tests.

I'll start by adding tests because it is also very good practice for me. This will allow us to measure the optimisations if we ever decide to implement them.

lucaspena commented 2 years ago

Thanks for the response, I understand the persistence property now. As for tests, if you would like to pair program some at some point this week or next, let me know. It would be a good learning experience for me as well.

Niols commented 1 year ago

As for tests, if you would like to pair program some at some point this week or next, let me know. It would be a good learning experience for me as well.

That sounds great; let's do that!

aspiwack commented 1 year ago

May I inquire why this PR is still ongoing? It's been open for a month, it should have been merged by now.

Niols commented 1 year ago

@0xd34df00d

That's a really good separation of responsibilities, thanks! In the long run, it'd be cool to have that as a separate library on Hackage, as I can see it useful elsewhere as well, and other libraries have some downsides (at least from a quick survey).

I had the same impression about other libraries which is why I decided to go for a home implementation. I suppose it could be separated from Pirouette indeed.

Having said that, I only have minor concerns. In particular, I'm not sure that Ancestor should really contain Maybe — if the UF user has a use case with partiality, they can just have UnionFind key (Maybe val). At least, I don't feel like Nothing is intrinsic to the UF itself. I see there was some discussion about that already, but I'm not sure how relevant it is now, so we can catch up on that separately.

It is a good question and I'm not sure what the right answer is. For now, Nothing is used to convey the fact that we know of an equivalence class but we haven't bound a value to it. It seems sometimes important, for instance for code that learns equivalences and assignments one by one and not necessarily in an order that would work without a Maybe.

The code would probably be nicer without a Maybe under Ancestor. At least, it would avoid merging two different pieces of information into one, and it would avoid potential bugs like the one discussed in https://github.com/tweag/pirouette/pull/150#discussion_r991250090.

Also, ideally it'd be great to have more tests, especially property-based ones, but that can wait until when (and if) the UF implementation will be factored out to a separate library. Passing the usual pirouette tests is good enough for now, since they exercise enough of UF machinery.

I would be all for property-based tests for this little piece of code. I hope that the usual Pirouette tests are good enough but I am not sure. Do we have property-based testing machinery in the Pirouette project or would that be adding big dependencies?

Niols commented 1 year ago

@aspiwack

May I inquire why this PR is still ongoing? It's been open for a month, it should have been merged by now.

You are completely right, this PR should be merged by now (or closed; but not stalled). I left it hanging because I wasn't sure that it was the right way to go. I will make sure that we review and merge it quickly.

Niols commented 1 year ago

This PR seems ready to be merged. It passes all the tests that were previously there as well as the newly-introduced ones. It now implements the classic optimisations of the Union-Find structure. It could be further optimised by using a structure different from a Map (but still persistent), but I don't think it makes much sense to go through the trouble right now. We can always iterate later on if we feel the need.