Migrate from django-mptt to django-tree queries

rebeccacremona commented 1 month ago

See ENG-863.

This PR swaps in Django Tree Queries in place of Django MPTT, for keeping track of how Perma's folders are nested into trees.

Reviewing

I apologize for this beast of a PR.

I am pretty confident in the code, and folder-related logic is well-covered by our test suite. I've fully run, and fully reverted, the entire migration twice locally on a production-like database, and have verified that both packages produce an identical representation of the folder tree.

So, if there is a humongous problem... it's probably something about performance that I didn't think to test.

Even just a check for typos would be super helpful..... and I would love to know if you see anything that you think is too unreadable, or needs a comment.

In case it makes it EASIER to read, rather than just....... meaning there is now a book to read in addition to a PR..... I wrote a book of context. Please don't even glance at the below unless you find it helpful.

Abashedly, b

Motivation

To build a tree, all you actually need is for each node to know who its parent is. But in practice, to keep a tree in a database and ask questions about it efficiently, you need some cleverness.

Django MPTT indeed relies primarily on a parent field, but also stores a tree_id integer intended to uniquely identify all members of a given subtree, and lft and rght integer fields to describe the shape of the tree (that's how Modified Preorder Tree Traversal works). It uses lft, rght, and tree_id to efficiently look things up, instead of making more expensive queries that only look at parent.

But, for that to work out, your application has to carefully handle simultaneous operations (like, two users moving two folders at once). Otherwise, tree_id, lft, rght and level can get messed up, and lookups and tree descriptions will return incorrect results.

And that is the situation we found ourselves in: many of our folders' tree_id, lft, rght and level fields were incorrect... and certain operations, like inserting new top-level nodes into the tree... which requires updating the lft and rght of lots of siblings... is now pretty slow.

Django Tree Queries uses a different strategy for all this: it only relies on a parent field, and it uses fancy recursive SQL to build a representation of the tree in the database on the fly, any time you need to look up anything about tree structure. Because it does it on the fly... there are no extra fields in the database that could get out of sync... and so it is much harder for your trees to become corrupted. It also makes inserting new nodes faster: changing something about one node doesn't mean all the other nodes lft/rght/level potentially need to be updated!

So, we decided to switch packages.

Caveat

You might imagine that fancy recursive SQL that builds a representation of the tree in the database on the fly could be slow, if your tree is big enough. You would be right. For a table the size of Perma's folder table, on my laptop, it takes between 1 and 2 seconds... which is way too slow.

So, I'm introducing a performance hack that adds back in a little bit of the unreliability of the other approach: in addition to knowing its parent, every folder needs to know which folder is the apex of its little subtree, its tree's root_folder. Redundantly caching that value in the database allows us to then tell Django Tree Queries NOT to try and build a representation of the entire folder tree on the fly.... but instead, to restrict its understanding to the particular subtree in question..... which is lightning fast.

I think the tradeoff is more than worth it.

Tour of small stuff

New fields, old fields

This PR adds two new fields to the Folder model. Both are de-normalized fields, added for performance only:

tree_root, which keeps track of the id of the "apex" folder in any folder's subtree: the first of its ancestors
cached_has_children, which keeps track of whether any given folder is a "leaf" or a "branch"

It populates these fields during the migrations (1, 2).

It also drops the fields that came with Django MPTT: tree_id, lft, rght, and level.

On my machine, run on a production-like database, the migrations only took a total of 7s, so I am not worried that they will take overly long during deployment, even if they take somewhat longer than 7s.

New methods

Django MPTT included some methods that are not included with Django Tree queries: get_ancestors, get_descendants, and is_leaf_node. I decided to add methods with those names replicating their functionality, rather than otherwise refactoring our code.

Because of having adding cached_has_children, I also customized Folder's delete method: now, when you delete a folder (only allowed if a folder is entirely empty), we update the cached_has_children attribute of its parent in a transaction.

Old methods

It removes two unused/unnecessary model methods, and updates get_path to work with the new package.

get_path now looks pretty complicated.

Here is what's going on there. Since Django Tree Queries uses fancy recursive SQL to build a representation of the tree on the fly, every time you ask it a question about the tree's structure.... it doesn't turn itself on by default. If you just do a normal Django lookup, like Folder.objects.get(id=1), the folder object you get back won't know anything about the structure of the tree. In order to get Django Tree Queries to run its fancy SQL, you have to explicitly ask, by specifying with_tree_fields as a custom manager, like this: Folder.objects.with_tree_fields().get(id=1). Then, it will include its extra SQL in the query, and your object will come back annotated with tree_depth, tree_path and tree_ordering.

I can't know whether, when you call folder.get_path, you loaded the folder with_tree_fields() or not... so the method now accounts for both possibilities. If you loaded it with tree fields, great, we just format the path. If you didn't, then an Attribute Error will be raised... and we'll re-load the folder from the database with_tree_fields.

I mentioned adding tree_root as a performance hack: you can see it being used here in get_path. By adding tree_filter(tree_root_id=self.tree_root_id) to the query, we tell Django Tree queries NOT to bother trying to understand the tree structure of the entire Folder table, before answering this question: we tell it that it only needs to bother to understand the subtree with tree_root_id=self.tree_root_id... way faster.

Tweak to Celery task

I updated the task that the refreshes the value of folder.cached_path so that it reflects whatever the structure of the tree is at the moment. Now it works with the new package and updates the new field tree_root_id while it's in there.

Unimportant changes to the tests

The changes to the tests are insignificant.

I had to re-order a couple lines subsequent to adding the cached_has_children field, so that the tests were run against the "fresh" copies of these objects, where that value was accurate.

I had to remove the old fields from our old-fashioned Django test fixtures.

I also had to add some code to conftest.py to make sure that those JSON-mediated fixtures have all their tree-related fields set correctly. I decided that was MUCH easier than individually fussing with each of the fixtures, manually updating foreign keys, etc.

We are in the process of migration away from these fixtures anyway... Once we've switched entirely modern pytest fixtures that code can go.

Unimportant change to signals

You may or may not have worked with Django signals before... basically, they are little hooks into an object's life-cycle. "When someone calls save on an object, please run my pre_save signal receiver first. And please run my post_save signal receiver afterwards." Things like that. You don't really need to know.

The only thing to know here is, we keep track of changes to Link using "Django Simple History", and Django simple history uses a post-save signal to do its thing. Don't worry about why, but I had to use an F expression to fix a rare race condition in saving links, after you move the folder they are in. And.... Django Simple History breaks when you use F expressions.

So, I had to add special handing to signals.py, that swaps out the F expression for the correct value, when saving the History object. I got the code from an issue linked in their repo: I think it's not worth spending any time studying.

Removing old invoke tasks

I removed the now-obsolete invoke tasks for validating and rebuilding the old Django MPTT trees.

Django admin

I updated the Django admin to work with the new package, and added a filter so that it is easy to just one subtree of folders at a time.

Tour of big stuff

Validation

Django MPTT used to do validation, before letting you save a tree.

Django Tree Queries doesn't so I had to add it to our other validation code.

Because I added it to the serializer, that DOES allow you to make invalid moves using the save method. But... that is pretty normal: if you are using bare save, outside of the context of the serializer, you are probably a dev in the Django shell doing surgery, and we don't really need to protect against that.

`folder.save()`

The Folder's save method... was unreadable before this PR, and is worse now.

BUT.... my very next job in this project is, to refactor it, to solve concurrency issues that are not solved by this PR. So, I haven't tried to completely change it here.

Instead, I tried to keep the structure as similar as possible to how it was before, making any changes side-by-side in earlier commits. It may be easiest to read looking at https://github.com/harvard-lil/perma/pull/3527/commits/a57fcabfb3634b6f06bafbf55fb52ec7afdd46d7#diff-0950eb17b0561e9ae6b5a6cad00edcff38c9db5d41112e45427d90b73fa7fbf7R1429-R1449, and observing how the code in the if settings.TREE_PACKAGE == 'mptt': sections is different than the code in the elif settings.TREE_PACKAGE == 'tree_queries'.

But... could be this is just plain inscrutable to anybody not me.

And if that's the case, just say so, and I'll... make it better 😀.

TLDR: thanks for even a minute or two of your time.

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 87.32394% with 18 lines in your changes are missing coverage. Please review.

Project coverage is 69.53%. Comparing base (42c3d5d) to head (20fff67). Report is 149 commits behind head on develop.

:exclamation: Current head 20fff67 differs from pull request most recent head 0cae005

Please upload reports for the commit 0cae005 to get more accurate results.

Files	Patch %	Lines
perma_web/perma/celery_tasks.py	0.00%	9 Missing :warning:
perma_web/perma/models.py	91.30%	8 Missing :warning:
perma_web/perma/admin.py	92.85%	1 Missing :warning:

Additional details and impacted files

```diff @@ Coverage Diff @@ ## develop #3527 +/- ## =========================================== + Coverage 68.60% 69.53% +0.92% =========================================== Files 48 48 Lines 6794 6801 +7 =========================================== + Hits 4661 4729 +68 + Misses 2133 2072 -61 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rebeccacremona commented 1 month ago

Performance comparisons:

Admin folder list load -> 53ms to 54 ms Moving a folder -> 0.2 seconds to 0.09 seconds New org folder -> 2.55 seconds to 0.08 seconds

Screenshots documenting the above, before/Django MPTT

An admin loads the dashboard, with all of Perma's org folders

A folder with lots of nested subfolders is moved from Personal Links to an org, and then back again

A new organization (and its folder) are created

Screenshots documenting the above, after/Django Tree Queries

An admin loads the dashboard, with all of Perma's org folders

That same folder with lots of nested subfolders is moved from Personal Links to an org, and then back again

A new organization (and its folder) are created

rebeccacremona commented 1 month ago

Performance comparison:

Loading a folder with_tree_fields with and without the tree_root hack: 1183 ms versus 2 ms

rebeccacremona commented 1 month ago

On the performance front: I would imagine that a solution that involves a lot of recursion (and, specifically, recursive SQL queries) would not necessarily be significantly slower, but use much more memory. It might be worth having a look at memory usage with this new solution, potentially on staging 😄 .

Ah! I did not look at memory usage at all, good thinking, thanks!!

sentry-io[bot] commented 1 month ago

Suspect Issues

This pull request was deployed and Sentry observed the following issues:

‼️ ProgrammingError: column "lft" of relation "perma_folder" does not exist /api/v1/{parent_type}/{parent_id}/folders/ View Issue

_{Did you find this useful? React with a 👍 or 👎}

harvard-lil / perma