peopledoc / django-ltree-demo

A demo for storing and querying trees in Django using PostgreSQL
MIT License
94 stars 12 forks source link
approved-public django ghec-mig-migrated hierarchical-data ltree postgresql python trees

How to store trees with Django & PostgreSQL

Rationale

If you ever had the need to store hierarchical data (trees) with Django, you probably had to use a library like django-mptt or django-treebeard. Those libraries work fine at a small scale, but here at PeopleDoc we have encountered a lot of issues when using them at a bigger scale (tables with hundreds of thousands of rows and quite a lot of writings).

It turns out that storing trees in a database has been a solved problem since a long time, at least with PostgreSQL. The ltree extension provides a convenient data structure which is very fast on reads, and with almost no impact on writes. The algorithm used is very close to django-treebeard's materialized paths, but with all the power of PostgreSQL.

The main downside of using ltree is that you have to maintain the materialized path yourself. It doesn't come with any tool to do it automatically. But fortunately, it's actually quite simple to maintain this path using PostgreSQL triggers!

Integration with Django

In demo/categories/ltree.py you will find a very simple Django field for the ltree data type. This field can be used in any Django model, and adds two lookups: descendant and ancestor. Those lookups allow you to query the descendants or the ancestors of any object with a very simple SQL query.

For example, let's say you have the following model:

from django.db import models

from project.ltree import LtreeField

class Category(models.Model):
  parent = models.ForeignKey('self', null=True)
  code = models.CharField(maxlength=32, unique=True)
  path = LtreeField()

The path field represents the path from the root to the node, where each node is represented by its code (it could also be its id, but using the code is more readable when debugging). For example, if you have a genetic category, under a science category, under a top category, its path would be top.science.category.

Thanks to the descendant and ancestor lookups, the get_descendants method in django-mptt can be rewritten as:

def get_descendants(self):
    return Category.objects.filter(path__descendant=self.path)

This would generate a SQL query close to:

SELECT * FROM category WHERE path <@ 'science.biology'

The magic part: PostgreSQL triggers

If you add a ltree field to your model, you will have to keep the field up-to-date when inserting or updating instances. We could do that with Django signals, but it turns out that PostgreSQL is far better for maintaining integrity & writing efficient code.

Every time we insert or update a row, we can reconstruct its path by appending its code to the path of its parent. If the path has changed, we'll also need to update the path of the children, which can be written as a simple UPDATE query.

All that can be done easily with PostgreSQL triggers. You can find an implementation of those triggers in the file demo/categories/sql/triggers.sql.

The demo

In the demo, the following files are the most important:

How to install the demo

Conclusion

With a few lines a declarative, idiomatic Django code and ~50 lines of SQL we have implemented a fast and consistent solution for storing and querying trees.

Sometimes it's good to delegate complicated data manipulation to the database instead of doing everything in Python :) .