rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

Optimization of Fake Update strategy on Postgres #124

Closed mobiware closed 1 year ago

mobiware commented 1 year ago

The subquery used to pick a random row in the Seed Table was slow since it required sorting the Seed Table for every row to update in the Target table.

This PR proposes an optimized method for picking a random row in the Seed Table. The tradeoff is that it is pseudo random, not truly random.

The idea is:

We achieved significant speed ups with this method, for instance a particular table that would take about 17 min to anonymize with the previous query only takes 16s now

rwnx commented 1 year ago

also on this:

The tradeoff is that it is pseudo random, not truly random

I was thinking through the potential issues that this might cause. The main change i can see is that multiple updates on the same table would be the same (because the id is calculated from the md5 of the row).

I think it's worth mentioning for due dilligence - I don't think that's a problem or a breaking change. people running multiple updates to a row is an edge case, most anonymization strategies i've seen contain one entry per table. I certainly wouldn't expect anyone to rely on that behaviour.

mobiware commented 1 year ago

The main change i can see is that multiple updates on the same table would be the same (because the id is calculated from the md5 of the row).

Isn't the MD5 calculated on individual columns ? I.e. if multiple updates are performed on a row but the different updated columns in that row contain different data, wouldn't that result in different fake data ?

rwnx commented 1 year ago

Isn't the MD5 calculated on individual columns ? I.e. if multiple updates are performed on a row but the different updated columns in that row contain different data, wouldn't that result in different fake data ?

Yeah, i guess so! Ha, so this is even less of a thing to worry about.

Looks great 🎉 I'll merge today and then look at releasing over the next week or so.

rwnx commented 1 year ago

Update: These changes are out in the wild in v1.25.0: https://pypi.org/project/pynonymizer/1.25.0/