rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
105 stars 38 forks source link

Allow uniqueness for any type, not just emails and usernames #86

Closed alex-grover closed 3 years ago

alex-grover commented 3 years ago

Issue

Something I've run into while adding pynonymizer into my company's workflow is the inability to have table-level unique fake data. It's possible for emails and usernames, but not for anything else (like random strings/pystr for instance).

Concrete example: we have a phone_numbers table with a column number: unique varchar<10>. It's currently not possible to fill this column with fake, unique number strings. If you use pystr then you will sometimes run into unique constraint violations (and assuming the table is larger than the seed data set you may always run into issues).

For columns without length constraints you can kind of hack it together by using unique_email or unique_login but there are a couple limitations:

Solution I'd like

Ideally, being able to provide an argument like unique: true to a column strategy. If this is more difficult to implement, expanding unique_email and unique_login to include more types might be sufficient for most use cases as well, although it would be much more useful to be able to pass arguments to the fakes as well as guaranteeing their uniqueness.

Not sure of any workable alternatives, besides maybe changing your schema to remove the unique constraint or allow nulls, but those both don't seem like ideal solutions to the problem.

rwnx commented 3 years ago

Hi!

In short, at the current time pynonymizer's whole process is about leveraging the database engine to make bulk updates on big datasets much more efficient. The cost of this efficiency is the non-uniqueness.

The reason i'm dragging my feet with this one is that it's not going to be a trivial feature to implement, since it will need to be it's own seperate update process that updates as a row-by-row basis.

The "official exit trapdoor" for this behaviour is to write something as a literal in sql that meets your data format, in the same way that unique_login does. e.g. MD5(FLOOR((NOW() + RAND()) * (RAND() * RAND() / RAND()) + RAND())). Does that workaround work for you?

We can definitely use this thread to brainstorm and talk about how we see that row-by-row uniqueness being implemented, which i think would be helpful.

alex-grover commented 3 years ago

understood. I can probably make the workaround work, thanks for sharing!

Another potential solution would be to allow setting the size of the seed data pool - if i can choose for it to be 10x larger than the number of rows i have to anonymize then the chances of a collision are lower. Not sure if that exists already

rwnx commented 3 years ago

@ajgrover you can do this already 😇 check out the --seed-rows SEED_ROWS option in the CLI. / seed_rows kwarg from the main fn.

alex-grover commented 3 years ago

awesome. i think the random literal approach can solve nearly all cases (it works for my setup at least) so going to close this for now. thanks for the quick response!