rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

Pseudononymize current value with a custom provider #130

Closed dosera closed 11 months ago

dosera commented 11 months ago

I have a table/column that I want to pseudonymize where it is not sufficient to work with the LITERAL type (my database does not support what I want to do). I created a custom provider as follows:

from faker.providers import BaseProvider

class MyProvider(BaseProvider):
    def mypseudo(self):
        return originalValue.replace("a", "b")

which I can use in my strategy file just fine:

providers:
  - providers.myprovider.MyProvider

tables:
  TABLE:
    columns:
      COLUMN:
        type: mypseudo

Describe the solution you'd like I want to have a way to give the current column value to my mypseudo to perform "things" on it.

Describe alternatives you've considered I tried using fake_args as written in the docs - but didn't manage to get the current value "injected" into my method. E.g. something like

...
      COLUMN:
        type: mypseudo
        fake_args:
          current: (SELECT COLUMN)

Did not work.

Am I missing something or is this feature not available yet?

rwnx commented 11 months ago

Hi,

LITERAL is the right fit for you here. It's the only mechanism for doing row-level operations. It's a little counter-intuitive, but the fake_update type and providers in pynonymizer is pre-seeded and that means it doesn't have access to the row data when it replaces the value.

[from README.md]

Pynonymizer's main data replacement mechanism fake_update is a random selection from a small pool of data (--seed-rows controls the available Faker data). This process is chosen for compatibility and speed of operation, but does not guarantee uniqueness

Can you add any info here on what you're trying to do to the column and what database you're using? It's hard to know how likely or unlikely your use-case is and what to recommend.

from your example (although i suspect there might be more to it!) text replacement is possible in all 3 dbs:

dosera commented 11 months ago

Hi,

LITERAL is the right fit for you here. It's the only mechanism for doing row-level operations. It's a little counter-intuitive, but the fake_update type and providers in pynonymizer is pre-seeded and that means it doesn't have access to the row data when it replaces the value.

[from README.md]

Pynonymizer's main data replacement mechanism fake_update is a random selection from a small pool of data (--seed-rows controls the available Faker data). This process is chosen for compatibility and speed of operation, but does not guarantee uniqueness

Hi, ok that's what I was afraid of.

Can you add any info here on what you're trying to do to the column and what database you're using? It's hard to know how likely or unlikely your use-case is and what to recommend.

from your example (although i suspect there might be more to it!) text replacement is possible in all 3 dbs:

Well I am using (unfortunately) mysql 5.7 where there is no regex_replace function (and I don't have the possibility to create any custom functions or similar). My concrete usecase is:

given: key1="10.0|%"; mail="email1@customer.com;email.2@customer.com"; another_property="false";

here I want to replace the content in the "mail ''key''".

Thanks.

rwnx commented 11 months ago

I agree, you definitely can't do a regexp replace on a string in 5.7. (edit: to clarify, I dont know of a nice way to do any kind of pattern match in 5.7. it's a nightmare)

But here's some alternative ideas that might help:

dosera commented 11 months ago

I agree, you definitely can't do a regexp replace on a string in 5.7. (edit: to clarify, I dont know of a nice way to do any kind of pattern match in 5.7. it's a nightmare)

But here's some alternative ideas that might help:

  • If the application supports other key names, Rename the existing key1 with something that wont be used, and concat your own value: ( CONCAT('key1="something"; ', REPLACE(foo, 'key1=', 'unusedkey1=')) ) Example
  • Do you need to preserve all the other values? you could generate the whole value with no other info: e.g. literal with some random text function ( CONCAT('key1="', LEFT(MD5(UUID()), 8), '"; mail="a_predictable_value";') ) Example

Thank your for your help & ideas. Unfortunately both won't work as the values after mail are relevant as well. Furthermore the order is not specified as well as if mail is even part of my string.

rwnx commented 11 months ago

The first example will preserve other keys and leave the rest of the string alone. Edit the sqlfiddle with more test cases and see!