shuttle-hq / synth

The Declarative Data Generator
https://www.getsynth.com/
Apache License 2.0
1.39k stars 109 forks source link

Obfuscation of production data instead of randomly created #170

Open mrsarm opened 3 years ago

mrsarm commented 3 years ago

Required Functionality Some times is hard to creates production data quality just using the same schema + random data. There are records that follow strict business rules, and random data with rules from generators is simply not enough.

Moreover, in these cases what you only want to anonymize is sensitive data like usernames, passwords and personal info.

Proposed Solution Would be easier a tool that using clever algorithms defines the model detecting automatically most of these sensitive fields, set the rules to anonymize / obfuscate them, and allows you to edit those definitions or add new ones.

So the basics are the same: you define a model first, that mostly is created automatically, and then you run the model to create the fake data, but using prod data also as input.

Also like currently Synth supports, the ability to set the "size" of the output is key, because some time a production DB contains gigs of data, and process all them is not possible, so the tool needs to be clever enough to anonymize a subset of it, without the need to query all the records on each table.

Use case E.g. when working with event driven information, some time a "report" triggers the creation of many "task" records, and each task is not just related with the report, but also each one correlates , there are tasks of type "A", "B" and "C" that are always created when a report of type "X" is created (which are very different from the tasks created for reports of type "Y"), and each of these tasks has special fields regarding the type, so randomize this information in a way that makes sense for the app that uses it is almost impossible, the app will crash expecting the information to follow a given schema but also certain rules that aren't possible to generate with data definition rules.