Afraid of using production data due to privacy issues? Data Anonymization is a tool that helps you build anonymized production data dumps which you can use for performance testing, security testing, debugging and development.
Java/Kotlin version of tool supporting RDBMS databases is available with similar easy to use DSL.
Install gem using:
$ gem install data-anonymization
Install required database adapter library for active record:
$ gem install sqlite3
Create ruby program using data-anonymization DSL as following my_dsl.rb
:
require 'data-anonymization'
database 'DatabaseName' do
strategy DataAnon::Strategy::Blacklist # whitelist (default) or blacklist
# database config as active record connection hash
source_db :adapter => 'sqlite3', :database => 'sample-data/chinook-empty.sqlite'
# User -> table name (case sensitive)
table 'User' do
# id, DateOfBirth, FirstName, LastName, UserName, Password -> table column names (case sensitive)
primary_key 'id' # composite key is also supported
anonymize 'DateOfBirth','FirstName','LastName' # uses default anonymization based on data types
anonymize('UserName').using FieldStrategy::StringTemplate.new('user#{row_number}')
anonymize('Password') { |field| "password" }
end
...
end
Run using:
$ ruby my_dsl.rb
Liked it? please share
SQLite database
MongoDB
Postgresql database having composite primary key
Please see the Github 0.5.1 milestone page for more details on changes/fixes in release 0.5.1
Major changes:
Please see the Github 0.5.0 milestone page for more details on changes/fixes in release 0.5.0
Major changes:
Please see the Github 0.3.0 milestone page for more details on changes/fixes in release 0.3.0
MVP done. Fix defects and support queries, suggestions, enhancements logged in Github issues :-)
Please use Github issues to share feedback, feature suggestions and report issues.
For almost all projects there is a need for production data dump in order to run performance tests, rehearse production releases and debug production issues. However, getting production data and using it is not feasible due to multiple reasons, primary being privacy concerns for user data. And thus the need for data anonymization. This tool helps you to get anonymized production data dump using either Blacklist or Whitelist strategies.
Read more about data anonymization here
This approach essentially leaves all fields unchanged with the exception of those specified by the user, which are scrambled/anonymized (hence the name blacklist).
For Blacklist
create a copy of prod database and chooses the fields to be anonymized like e.g. username, password, email, name, geo location etc. based on user specification. Most of the fields have different rules e.g. password should be set to same value for all users, email needs to be valid.
The problem with this approach is that when new fields are added they will not be anonymized by default. Human error in omitting users personal data could be damaging.
database 'DatabaseName' do
strategy DataAnon::Strategy::Blacklist
source_db :adapter => 'sqlite3', :database => 'sample-data/chinook-empty.sqlite'
...
end
This approach, by default scrambles/anonymizes all fields except a list of fields which are allowed to copied as is. Hence the name whitelist. By default all data needs to be anonymized. So from production database data is sanitized record by record and inserted as anonymized data into destination database. Source database needs to be readonly. All fields would be anonymized using default anonymization strategy which is based on the datatype, unless a special anonymization strategy is specified. For instance special strategies could be used for emails, passwords, usernames etc. A whitelisted field implies that it's okay to copy the data as is and anonymization isn't required. This way any new field will be anonymized by default and if we need them as is, add it to the whitelist explicitly. This prevents any human error and protects sensitive information.
database 'DatabaseName' do
strategy DataAnon::Strategy::Whitelist
source_db :adapter => 'sqlite3', :database => 'sample-data/chinook.sqlite'
destination_db :adapter => 'sqlite3', :database => 'sample-data/chinook-empty.sqlite'
...
end
Read more about blacklist and whitelist here
schema_search_path
: source_db { ... schema_search_path: 'public,my_special_schema' }
We provide a command line tool to generate whitelist scripts for RDBMS and NoSQL databases. The user needs to supply the connection details to the database and a script is generated by analyzing the schema. Below are examples of how to use the tool to generate the scripts for RDBMS and NoSQL datastores
When you install the data-anonymization tool, the datanon command become available on the terminal. If you type datanon --help and execute you should see the below
Tasks:
datanon generate_mongo_dsl -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a Mongo DB using the database schema
datanon generate_rdbms_dsl -a, --adapter=ADAPTER -d, --database=DATABASE -h, --host=HOST # Generates a base anonymization script(whitelist strategy) for a RDBMS database using the database schema
datanon help [TASK] # Describe available tasks or one specific task
The gem uses ActiveRecord(AR) abstraction to connect to relational databases. You can generate a whitelist script in seconds for any relational database supported by Active Record. To do so use the following command
datanon generate_rdbms_dsl [options]
The options available are :
The adapter, host and database options are mandatory. The others are optional.
A few examples of the command is shown below
datanon generate_rdbms_dsl -a mysql2 -h db.host.com -p 3306 -d production_db -u root -w password
datanon generate_rdbms_dsl -a postgresql -h 123.456.7.8 -d production_db
The relevant db gems must be installed so that AR has the adapters required to establish the connection to the databases. The script generates a file named rdbms_whitelist_generated.rb in the same location as the project.
Similar to the the relational databases, a whitelist script for mongo db can be generated by analysing the database structure
datanon generate_mongo_dsl [options]
The options available are :
The host and database options are mandatory. The others are optional.
A few examples of the command is shown below
datanon generate_mongo_dsl -h db.host.com -d production_db -u root -w password
datanon generate_mongo_dsl -h 123.456.7.8 -d production_db
The mongo gem is required in order to install the mongo db drivers. The script generates a file named mongodb_whitelist_generated.rb in the same location as the project.
Currently provides capability of running anonymization in parallel at table level provided no FK constraints on tables. It uses Parallel gem provided by Michael Grosser. By default it starts multiple parallel ruby processes processing table one by one.
database 'DellStore' do
strategy DataAnon::Strategy::Whitelist
execution_strategy DataAnon::Parallel::Table # by default sequential table processing
...
end
The object that gets passed along with the field strategies.
has following attribute accessor
name
current field/column namevalue
current field/column valuerow_number
current row numberar_record
active record of the current row under processingContent | Name | Description |
---|---|---|
Text | LoremIpsum | Generates a random Lorep Ipsum String |
Text | RandomString | Generates a random string of equal length |
Text | StringTemplate | Generates a string based on provided template |
Text | SelectFromList | Randomly selects a string from a provided list |
Text | SelectFromFile | Randomly selects a string from a provided file |
Text | FormattedStringNumber | Randomize digits in a string while maintaining the format |
Text | SelectFromDatabase | Selects randomly from the result of a query on a database |
Text | RandomUrl | Anonymizes a URL while mainting the structure |
Content | Name | Description |
---|---|---|
Number | RandomInteger | Generates a random integer between provided limits (default 0 to 100) |
Number | RandomIntegerDelta | Generates a random integer within -delta and delta of original integer |
Number | RandomFloat | Generates a random float between provided limits (default 0.0 to 100.0) |
Number | RandomFloatDelta | Generates a random float within -delta and delta of original float |
Number | RandomBigDecimalDelta | Similar to previous but creates a big decimal object |
Content | Name | Description |
---|---|---|
Address | RandomAddress | Randomly selects an address from a geojson flat file [Default US address] |
City | RandomCity | Similar to address, picks a random city from a geojson flafile [Default US cities] |
Province | RandomProvince | Similar to address, picks a random city from a geojson flafile [Default US provinces] |
Zip code | RandomZipcode | Similar to address, picks a random zipcode from a geojson flafile [Default US zipcodes] |
Phone number | RandomPhoneNumber | Randomizes a phone number while preserving locale specific fomatting |
Content | Name | Description |
---|---|---|
DateTime | AnonymizeDateTime | Anonymizes each field (except year and seconds) within natural range of the field depending on true/false flag provided |
Time | AnonymizeTime | Exactly similar to above except returned object is of type 'Time' |
Date | AnonymizeDate | Anonymizes day and month within natural ranges based on true/false flag |
DateTimeDelta | DateTimeDelta | Shifts data randomly within given range. Default shifts date within 10 days + or - and shifts time within 30 minutes. |
TimeDelta | TimeDelta | Exactly similar to above except returned object is of type 'Time' |
DateDelta | DateDelta | Shifts date randomly within given delta range. Default shits date within 10 days + or - |
Content | Name | Description |
---|---|---|
RandomEmail | Generates email randomly using the given HOSTNAME and TLD. | |
GmailTemplate | Generates a valid unique gmail address by taking advantage of the gmail + strategy | |
RandomMailinatorEmail | Generates random email using mailinator hostname. |
Content | Name | Description |
---|---|---|
First name | RandomFirstName | Randomly picks up first name from the predefined list in the file. Default file is part of the gem. |
Last name | RandomLastName | Randomly picks up last name from the predefined list in the file. Default file is part of the gem. |
Full Name | RandomFullName | Generates full name using the RandomFirstName and RandomLastName strategies. |
User name | RandomUserName | Generates random user name of same length as original user name. |
field parameter in following code is DataAnon::Core::Field
class MyFieldStrategy
# method anonymize is what required
def anonymize field
# write your code here
end
end
write your own anonymous field strategies within DSL,
table 'User' do
anonymize('Password') { |field| "password" }
anonymize('email') do |field|
"test+#{field.row_number}@gmail.com"
end
end
DEFAULT_STRATEGIES = {:string => FieldStrategy::RandomString.new,
:fixnum => FieldStrategy::RandomIntegerDelta.new(5),
:bignum => FieldStrategy::RandomIntegerDelta.new(5000),
:float => FieldStrategy::RandomFloatDelta.new(5.0),
:bigdecimal => FieldStrategy::RandomBigDecimalDelta.new(500.0),
:datetime => FieldStrategy::DateTimeDelta.new,
:time => FieldStrategy::TimeDelta.new,
:date => FieldStrategy::DateDelta.new,
:trueclass => FieldStrategy::RandomBoolean.new,
:falseclass => FieldStrategy::RandomBoolean.new
}
Overriding default field strategies & can be used to provide default strategy for missing data type.
database 'Chinook' do
...
default_field_strategies :string => FieldStrategy::RandomString.new
...
end
How do I switch off the progress bar?
# add following line in your ruby file
ENV['show_progress'] = 'false'
Logger
provides debug level messages including database queries of active record.
DataAnon::Utils::Logging.logger.level = Logger::INFO
Skip is used to skip records during anonymization when condition returns true. This records are ignored, in blacklist it remains as it is in database and in case of whitelist this records will not be copied to destination database.
table 'customers' do
skip { |index, record| record['age'] < 18 }
primary_key 'cust_id'
anonymize('email').using FieldStrategy::StringTemplate.new('test+#{row_number}@gmail.com')
anonymize 'terms_n_condition', 'age'
end
Continue is exactly opposite of Skip and it continue with anonymization only if given condition returns true. In case of blacklist records are anonymized for matching conditions and for whitelist records are anonymized and copied to new database for matching conditions.
table 'customers' do
continue { |index, record| record['age'] > 18 }
primary_key 'cust_id'
anonymize('email').using FieldStrategy::StringTemplate.new('test+#{row_number}@gmail.com')
anonymize 'terms_n_condition', 'age'
end
git checkout -b my-new-feature
)git commit -am 'Add some feature'
)git push origin my-new-feature
)