jazzband / django-simple-history

Store model history and view/revert changes from admin site.
https://django-simple-history.readthedocs.org
BSD 3-Clause "New" or "Revised" License
2.19k stars 476 forks source link

Add option to avoid duplicates #302

Open kiyoqoko opened 7 years ago

kiyoqoko commented 7 years ago

Perform an additional check when True before saving a historical record. Same as in django-reversion https://django-reversion.readthedocs.io/en/stable/api.html#registration-api. I can submit a pull request.

macro1 commented 7 years ago

This has been requested as a feature several times in the past. I'm sure there is interest in having it. I believe a suggested strategy was to leverage Field Tracker. I'm sure a simple dirty flag would work just as well.

Edit: To be the devil's advocate... If you see duplicate records, technically your application is doing duplicate saves currently. You could also prevent the original duplicate saves, and the duplicate history should go away.

rossmechanic commented 6 years ago

Adding on to what @macro1 said, django-reversion mentions the downside of this:

Checking for duplicate revisions adds significant overhead to the process of creating a revision. Don’t enable it unless you really need it!

Without an efficient solution, I don't see it as something that's the responsibility of this library

nick-traeger commented 5 years ago

I could create a configuration that matches the record with diff_against on the last record before saving. The whole thing then costs a database query and some computing power extra, which would be worth it to me personally. Should I contribute this solution?

rossmechanic commented 5 years ago

Doesn't seem worth it to me with an extra db query. If you could use django-dirtyfields and avoid the db query, I'd be more open to it

Skeen commented 5 years ago

I'd love to see any implementation, one which does not produce extra db queries would be preferable, but any implementation would be a start.

Currently we have issues with duplicate histories from calls to Model.objects.update_or_create.

Could we go with @nick-traeger's idea of a quick and dirty implementation, and then eventually transition to a more efficient solution using django-dirtyfields or similar?

Skeen commented 5 years ago

What is the consensus on this, are we interested in doing a quick-and-dirty and eventually transitioning to a efficient solution, or will the efficient solution be required before a PR will be merged?

This is quite a critical issue for me, and I think I'd work on it, but I wanna know what the criteria is beforehand. I'm currently using a bulk version of the clean_duplicates method, but it takes a while since historical tables grow quite rapidly:

                       relation                       | total_size
------------------------------------------------------+------------
 public.webapp_historicalconstraintvalue              | 119 GB
 public.webapp_constraintvalue                        | 27 GB

Snippet from postgresql showing the extend of historical records. A rough review shows that 80% of the history entries are duplicates.

Edit: Data-size after another month of running:

 table_schema |            table_name            | total_size | data_size | external_size 
--------------+----------------------------------+------------+-----------+---------------
 public       | webapp_historicalconstraintvalue | 134 GB     | 42 GB     | 92 GB
 public       | webapp_constraintvalue           | 30 GB      | 2764 MB   | 27 GB

And after duplicate removal and full vacuum:

 table_schema |            table_name            | total_size | data_size | external_size 
--------------+----------------------------------+------------+-----------+---------------
 public       | webapp_constraintvalue           | 30 GB      | 2764 MB   | 27 GB
 public       | webapp_historicalconstraintvalue | 28 GB      | 10 GB     | 18 GB
JackLeo commented 3 years ago

Why not have it under some form of setting for the people who need write speed vs people who prefer to avoid background jobs cleaning up the database? The argument of an extra query is not worth it is a moot point if developers are given the choice. It just needs to be well documented and side-effects explained.