Closed rebeccacremona closed 1 month ago
Attention: Patch coverage is 87.32394%
with 18 lines
in your changes are missing coverage. Please review.
Project coverage is 69.53%. Comparing base (
42c3d5d
) to head (20fff67
). Report is 149 commits behind head on develop.:exclamation: Current head 20fff67 differs from pull request most recent head 0cae005
Please upload reports for the commit 0cae005 to get more accurate results.
Files | Patch % | Lines |
---|---|---|
perma_web/perma/celery_tasks.py | 0.00% | 9 Missing :warning: |
perma_web/perma/models.py | 91.30% | 8 Missing :warning: |
perma_web/perma/admin.py | 92.85% | 1 Missing :warning: |
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Performance comparisons:
Admin folder list load -> 53ms to 54 ms Moving a folder -> 0.2 seconds to 0.09 seconds New org folder -> 2.55 seconds to 0.08 seconds
An admin loads the dashboard, with all of Perma's org folders
A folder with lots of nested subfolders is moved from Personal Links to an org, and then back again
A new organization (and its folder) are created
An admin loads the dashboard, with all of Perma's org folders
That same folder with lots of nested subfolders is moved from Personal Links to an org, and then back again
A new organization (and its folder) are created
Performance comparison:
Loading a folder with_tree_fields
with and without the tree_root
hack: 1183 ms versus 2 ms
On the performance front: I would imagine that a solution that involves a lot of recursion (and, specifically, recursive SQL queries) would not necessarily be significantly slower, but use much more memory. It might be worth having a look at memory usage with this new solution, potentially on staging π .
Ah! I did not look at memory usage at all, good thinking, thanks!!
This pull request was deployed and Sentry observed the following issues:
/api/v1/{parent_type}/{parent_id}/folders/
View IssueDid you find this useful? React with a π or π
See ENG-863.
This PR swaps in Django Tree Queries in place of Django MPTT, for keeping track of how Perma's folders are nested into trees.
Reviewing
I apologize for this beast of a PR.
I am pretty confident in the code, and folder-related logic is well-covered by our test suite. I've fully run, and fully reverted, the entire migration twice locally on a production-like database, and have verified that both packages produce an identical representation of the folder tree.
So, if there is a humongous problem... it's probably something about performance that I didn't think to test.
Even just a check for typos would be super helpful..... and I would love to know if you see anything that you think is too unreadable, or needs a comment.
In case it makes it EASIER to read, rather than just....... meaning there is now a book to read in addition to a PR..... I wrote a book of context. Please don't even glance at the below unless you find it helpful.
Abashedly, b
Motivation
To build a tree, all you actually need is for each node to know who its parent is. But in practice, to keep a tree in a database and ask questions about it efficiently, you need some cleverness.
Django MPTT indeed relies primarily on a
parent
field, but also stores atree_id
integer intended to uniquely identify all members of a given subtree, andlft
andrght
integer fields to describe the shape of the tree (that's how Modified Preorder Tree Traversal works). It useslft
,rght
, andtree_id
to efficiently look things up, instead of making more expensive queries that only look atparent
.But, for that to work out, your application has to carefully handle simultaneous operations (like, two users moving two folders at once). Otherwise,
tree_id
,lft
,rght
andlevel
can get messed up, and lookups and tree descriptions will return incorrect results.And that is the situation we found ourselves in: many of our folders'
tree_id
,lft
,rght
andlevel
fields were incorrect... and certain operations, like inserting new top-level nodes into the tree... which requires updating thelft
andrght
of lots of siblings... is now pretty slow.Django Tree Queries uses a different strategy for all this: it only relies on a
parent
field, and it uses fancy recursive SQL to build a representation of the tree in the database on the fly, any time you need to look up anything about tree structure. Because it does it on the fly... there are no extra fields in the database that could get out of sync... and so it is much harder for your trees to become corrupted. It also makes inserting new nodes faster: changing something about one node doesn't mean all the other nodes lft/rght/level potentially need to be updated!So, we decided to switch packages.
Caveat
You might imagine that fancy recursive SQL that builds a representation of the tree in the database on the fly could be slow, if your tree is big enough. You would be right. For a table the size of Perma's folder table, on my laptop, it takes between 1 and 2 seconds... which is way too slow.
So, I'm introducing a performance hack that adds back in a little bit of the unreliability of the other approach: in addition to knowing its
parent
, every folder needs to know which folder is the apex of its little subtree, its tree'sroot_folder
. Redundantly caching that value in the database allows us to then tell Django Tree Queries NOT to try and build a representation of the entire folder tree on the fly.... but instead, to restrict its understanding to the particular subtree in question..... which is lightning fast.I think the tradeoff is more than worth it.
Tour of small stuff
New fields, old fields
This PR adds two new fields to the Folder model. Both are de-normalized fields, added for performance only:
tree_root, which keeps track of the id of the "apex" folder in any folder's subtree: the first of its ancestors
cached_has_children, which keeps track of whether any given folder is a "leaf" or a "branch"
It populates these fields during the migrations (1, 2).
It also drops the fields that came with Django MPTT:
tree_id
,lft
,rght
, andlevel
.On my machine, run on a production-like database, the migrations only took a total of 7s, so I am not worried that they will take overly long during deployment, even if they take somewhat longer than 7s.
New methods
Django MPTT included some methods that are not included with Django Tree queries:
get_ancestors
,get_descendants
, andis_leaf_node
. I decided to add methods with those names replicating their functionality, rather than otherwise refactoring our code.Because of having adding
cached_has_children
, I also customized Folder's delete method: now, when you delete a folder (only allowed if a folder is entirely empty), we update thecached_has_children
attribute of its parent in a transaction.Old methods
It removes two unused/unnecessary model methods, and updates
get_path
to work with the new package.get_path
now looks pretty complicated.Here is what's going on there. Since Django Tree Queries uses fancy recursive SQL to build a representation of the tree on the fly, every time you ask it a question about the tree's structure.... it doesn't turn itself on by default. If you just do a normal Django lookup, like
Folder.objects.get(id=1)
, the folder object you get back won't know anything about the structure of the tree. In order to get Django Tree Queries to run its fancy SQL, you have to explicitly ask, by specifyingwith_tree_fields
as a custom manager, like this:Folder.objects.with_tree_fields().get(id=1)
. Then, it will include its extra SQL in the query, and your object will come back annotated withtree_depth
,tree_path
andtree_ordering
.I can't know whether, when you call
folder.get_path
, you loaded the folderwith_tree_fields()
or not... so the method now accounts for both possibilities. If you loaded it with tree fields, great, we just format the path. If you didn't, then anAttribute Error
will be raised... and we'll re-load the folder from the databasewith_tree_fields
.I mentioned adding
tree_root
as a performance hack: you can see it being used here inget_path
. By addingtree_filter(tree_root_id=self.tree_root_id)
to the query, we tell Django Tree queries NOT to bother trying to understand the tree structure of the entire Folder table, before answering this question: we tell it that it only needs to bother to understand the subtree withtree_root_id=self.tree_root_id
... way faster.Tweak to Celery task
I updated the task that the refreshes the value of folder.cached_path so that it reflects whatever the structure of the tree is at the moment. Now it works with the new package and updates the new field
tree_root_id
while it's in there.Unimportant changes to the tests
The changes to the tests are insignificant.
I had to re-order a couple lines subsequent to adding the
cached_has_children
field, so that the tests were run against the "fresh" copies of these objects, where that value was accurate.I had to remove the old fields from our old-fashioned Django test fixtures.
I also had to add some code to conftest.py to make sure that those JSON-mediated fixtures have all their tree-related fields set correctly. I decided that was MUCH easier than individually fussing with each of the fixtures, manually updating foreign keys, etc.
We are in the process of migration away from these fixtures anyway... Once we've switched entirely modern pytest fixtures that code can go.
Unimportant change to signals
You may or may not have worked with Django
signals
before... basically, they are little hooks into an object's life-cycle. "When someone callssave
on an object, please run mypre_save
signal receiver first. And please run mypost_save
signal receiver afterwards." Things like that. You don't really need to know.The only thing to know here is, we keep track of changes to
Link
using "Django Simple History", and Django simple history uses a post-save signal to do its thing. Don't worry about why, but I had to use anF
expression to fix a rare race condition in saving links, after you move the folder they are in. And.... Django Simple History breaks when you use F expressions.So, I had to add special handing to
signals.py
, that swaps out the F expression for the correct value, when saving the History object. I got the code from an issue linked in their repo: I think it's not worth spending any time studying.Removing old invoke tasks
I removed the now-obsolete invoke tasks for validating and rebuilding the old Django MPTT trees.
Django admin
I updated the Django admin to work with the new package, and added a filter so that it is easy to just one subtree of folders at a time.
Tour of big stuff
Validation
Django MPTT used to do validation, before letting you save a tree.
Django Tree Queries doesn't so I had to add it to our other validation code.
Because I added it to the serializer, that DOES allow you to make invalid moves using the
save
method. But... that is pretty normal: if you are using baresave
, outside of the context of the serializer, you are probably a dev in the Django shell doing surgery, and we don't really need to protect against that.folder.save()
The Folder's
save
method... was unreadable before this PR, and is worse now.BUT.... my very next job in this project is, to refactor it, to solve concurrency issues that are not solved by this PR. So, I haven't tried to completely change it here.
Instead, I tried to keep the structure as similar as possible to how it was before, making any changes side-by-side in earlier commits. It may be easiest to read looking at https://github.com/harvard-lil/perma/pull/3527/commits/a57fcabfb3634b6f06bafbf55fb52ec7afdd46d7#diff-0950eb17b0561e9ae6b5a6cad00edcff38c9db5d41112e45427d90b73fa7fbf7R1429-R1449, and observing how the code in the
if settings.TREE_PACKAGE == 'mptt':
sections is different than the code in theelif settings.TREE_PACKAGE == 'tree_queries'
.But... could be this is just plain inscrutable to anybody not me.
And if that's the case, just say so, and I'll... make it better π.
TLDR: thanks for even a minute or two of your time.