More efficient clean_duplicate_history with PostgreSQL Partitioning

humphrey commented 1 year ago

Problem Statement ./manage.py clean_duplicate_history works great for small tables, but since it loads and processes each item individually, it becomes impractical to run once your tables have millions of rows. And that's when you most need to clean up duplicates.

Additionally, for tables with millions of rows, I usually have lots of fields excluded to save database size. This means that I end up with more duplicate historical rows, since the values that changes are not stored in the history table.

Describe the solution you'd like An option to offload the detection of duplicates to the database engine. With PostgreSQL this can be done using Partitioning, and I have written my own management commands with custom SQL to achieve this for my largest table. However, having this packaged up nicely within simple history would bring improvements across the board. But I 100% understand if this level of optimization is beyond the scope of simple-history and perhaps too complex to maintain.

Describe alternatives you've considered

Custom writing this SQL query for models where this is an issue (my current solution)
Using a custom coded history model that more closely matches my requirements.

Additional context Here is an example SQL query that achieve this (I've simplified the query I am currently using in my project).

Obviously this is a complex solution, which would require generating SQL based on the fields of you model. So, I'm throwing the idea out there, but also expect it might be a bit too much of a complicated solution for this project.

delete from myapp_historicalitem as h
where h.history_id in (
    select subq.history_id from (
        select 
            history_id,
            ROW_NUMBER () OVER (
                PARTITION BY 
                    my_field_1,
                    my_field_2,
                    my_field_3,
                    my_field_4,
                    my_field_5,
                    history_change_reason,
                    history_type,
                    history_user_id
                ORDER BY
                    id,
                    history_id,
                    history_date
            ) as n
        from myapp_historicalitem as hi
    ) as subq where subq.n > 1
)

humphrey commented 1 year ago

And, I assume this could be implemented using a Django Window expression. I just choose to solve it in SQL for my use case.

VictorColomb commented 2 months ago

I believe your SQL request does not exactly perform what it is meant to @humphrey. You mentioned that this is not the exact query you are using but a simplified one, which could have introduced the limitation that follows. At least, this one does not behave in the same way as the clean_duplicate_history command.

Take the following example history table (much simplified), with id the primary key of the model and value the one and only attribute of the model. Note: I've omitted other history columns for the sake of simplicity.

id	history_id	value
1	4	0
1	3	1
1	2	0
1	1	0

You would want to delete the entry with history_id = 2 only, but your request will delete entries with history_id in (2,4).

As of right now, I have been unable to write a query using Postgresql partitions that behaves exactly like the clean_duplicate_history command. If I succeed, I will edit this comment with my findings.

EDIT

I came up with this procedure, which takes advantage of Postgresql's lag window function. The underlying delete request is performed on batches of 500 entries in order to keep memory usage in check.

paste

tim-schilling commented 2 weeks ago

I think for this to be usable it needs to use the Django ORM. There are too many customizations that models can have to use raw SQL for this.

humphrey commented 2 days ago

I think for this to be usable it needs to use the Django ORM. There are too many customizations that models can have to use raw SQL for this.

I agree! Below is a quick proof of concept using the Django ORM to address this. The implementation utilizes window functions to optimize the removal of duplicate histories. I've tested this with a simple model. To give it a try, place the code in [yourapp]/management/commands/clean_duplicate_history_using_windowing.py.

Note: This approach overrides the original handle() function as a quick hack, so it skips much of the original code, and will ignore most CLI options. Let me know what you think!

from django.db import models
from django.db.models import F, Window
from django.db.models.functions import Lag

from simple_history import utils
from simple_history.management.commands import clean_duplicate_history

class Command(clean_duplicate_history.Command):

    def _process_model_using_windowing(self, model_cls: models.Model, history_cls: models.Model):
        """
        Quick hacky proof of concept function.
        This is example of how the lag() window function might be used to optimise removal of duplicate histories.
        """
        history = utils.get_history_manager_for_model(model_cls)

        tracked_fields = [
            field.name for field in history_cls._meta.fields
            if field.name not in ('id', 'history_id', 'history_date', 'history_change_reason', 'history_type', 'history_user')
        ]
        # Will be used to annotate the previous row's values for all tracked fields
        lags = {
            f'_lag_{f}': Window(
                expression=Lag(f),
                partition_by=[F('id')],
                order_by='id',
            )
            for f in tracked_fields
        }
        # Will be used to filter the History Queryset to include only duplicate entries
        filters = {
            f: F(f'_lag_{f}')
            for f in tracked_fields
        }

        # Build up a queryset using Window functions
        qs = history.all()
        qs = qs.annotate(**lags)
        qs = qs.filter(**filters)

        # Execute and delete
        duplicate_history_ids = qs.values_list('history_id', flat=True)
        n_deleted, _ = history.filter(history_id__in=duplicate_history_ids).delete()

        print(f'Deleted {n_deleted} duplicate histories')

    def handle(self, *args, **options):
        # Overriding handle() as a quick hack to test the above windowing method
        if model_strings := options.get("models", []) or args:
            for model_pair in self._handle_model_list(*model_strings):
                self._process_model_using_windowing(*model_pair)

Would a solution along these lines be accepted?

For scenarios were windowing functions are less efficient that the existing method, perhaps the method could be selected using an option such as --windowing?

tim-schilling commented 1 day ago

Oh, that looks really nice. I'd be a +1 on using it if we know it's an improvement.

humphrey commented 1 day ago

Oh, that looks really nice. I'd be a +1 on using it if we know it's an improvement.

Thanks!

What would be required to get something like this merged in? I assume the next step would be to do some performance tests across a number of different use cases and database engines to ensure this approach is an improvement that is worth it?

I know from experience that bulk deleting a large number of rows from a very large Postgres table can also have performance issues, so this wouldn't be a silver bullet. I suspect providing a --batch-size option would likely address this, as developeres could choose a size that works for their use.

jazzband / django-simple-history

More efficient clean_duplicate_history with PostgreSQL Partitioning #1013