rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
102 stars 38 forks source link

Generate random string values to not encounter duplicate entry issues #106

Closed armorKing11 closed 2 years ago

armorKing11 commented 2 years ago

Is your feature request related to a problem? Please describe. Currently when you try to anonymize a column say (sys_id) of type VARCHAR and if it has duplicate foreign keys , pynonymizer throws error

Duplicate entry '0efg4t5fs5d-test1' for key 'api_product_id_sys_id'

where for api_product_id = 0yb6h5jfs5d , api_product_id the foreign key and for sys_id = test1, sys_id is the column to be updated

when using a strategy file as below

tables:
  system:
    columns:
      sys_id: ( 'test1' )

How can i specify in the strategy file to generate random string values without using the faker lbrary generators ? The faker library does not guarantee unique values cos when i tried with say word faker generator , it still gave me an error and failed

ERROR 1050 (42S01) at line 1: Table '_pynonymizer_seed_fake_data' already exists

And for a large number of rows ,the faker library generators adds an overhead (since it needs to generate the value and insert it )slowing down the anonymization process . I know that there is a RAND() function , but that generates a value from 0 to 1 . Can that be used to say generate a value range lke

test1
test2
test3

or

sdfsd
evbrr
dfdgd

And why does pynonymizer check the combination of foreign key and the column to be updated instead of the primary key and the column to be updated ?

Describe the solution you'd like

Have either a way to combine a string value and the RAND() function possibly ,like ( 'test' + RAND() ) or a random string generator where you could specify the length of the string like ( RAND_STR(5) ) where 5 is the length of the string

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

rwnx commented 2 years ago

Hi, here are some things i might be able to answer for you:

pynonymizer is only really a fancy harness for running sql statements. that error is coming from the database - pynonymizer does no 'check'

the fake_update strategy does not provide unique values.

There is some more discussion on 'uniqueness' in this issue: https://github.com/rwnx/pynonymizer/issues/86

For uniqueness, you have two options:

  1. use a unique_login column strategy (a random string) https://github.com/rwnx/pynonymizer/blob/master/doc/strategyfiles.md#column-strategy-unique_login

  2. design your own random string using the appropriate literal for your db. e.g. in your example: (select CONCAT('test', RAND()) ) https://github.com/rwnx/pynonymizer/blob/master/doc/strategyfiles.md#column-strategy-literal https://stackoverflow.com/questions/16737910/generating-a-random-unique-8-character-string-using-mysql

armorKing11 commented 2 years ago

Thank you for the options @rwnx . Had 3 questions from the above options:

  1. As per Alex's comment on seed rows , ie, setting the value of seed_rows to a larger value , what is the default value for seed_rows and for say a faker generator like word , would an increase in seed_rows prevent collisions ? https://github.com/rwnx/pynonymizer/issues/86#issuecomment-920967092

  2. When alex grover mentions about the random literal in below comment is it in reference to ( RAND() ) and if so are the random numeric values like say 0.43233 generated by RAND() converted to string before being written to the specified column field https://github.com/rwnx/pynonymizer/issues/86#issuecomment-921076192

  3. When generating a random string using your example (select CONCAT('test', RAND()) ) , how can i control the number of characters of the string ,ie, in the above case when using RAND() , how can i round the number of decimal places ?

rwnx commented 2 years ago
  1. The default is 150 - setting this to a higher value with reduce the likelihood, but will not prevent collisions. the fake_update type was not built for this use-case.
  2. to be clear: there is no "random literal" - there is a literal update strategy. this strategy just replaces the value part of an update statement.

[...] so you must ensure that it is correct and executable by your provider

There is no "checking", or conversion or any other interaction by pynonymizer. the value is pasted in verbatim. If you need a string, you need to make sure a string is being outputted by the statement.

e.g.

tables:
  system:
    columns:
      sys_id: ( 'test1' )
      other_column: ( RAND() )

# this turns into this update sql:
# UPDATE system SET sys_id = ( 'test1' ), SET other_column = ( RAND() )

the values you put here have to be parsable and executable by your database provider - in your case, mysql.

I linked you a stackoverflow post in my last reply that had a lot of discussion on "Generating a random & unique 8 character string using MySQL", but the examples in that thread could easily be customized for your exact needs. This isn't really a pynonymizer thing, but instead a general sql problem.

Sometimes i use http://sqlfiddle.com/ to try out potential values. Hope that helps ✨

armorKing11 commented 2 years ago

Thank you very much for your guidance @rwnx , appreciate it !!

rwnx commented 2 years ago

I'm closing this issue now as it seems that we've resolved the uncertainty here.