humanmade / go-anonymize-mysqldump

Allows you to pipe data from mysqldump or an SQL file and anonymize it.
GNU General Public License v3.0
60 stars 25 forks source link
anonymization sql

anonymize-mysqldump

CircleCI

Allows you to pipe data from mysqldump or an SQL file and anonymize it:

mysqldump -u yada -pbadpass -h db | anonymize-mysqldump --config config.json > anonymized.sql
usage: anonymize-mysqldump [-h|--help] -c|--config "<value>"

                           Reads SQL from STDIN and replaces content for
                           anonymity based on the provided config.

Arguments:

  -h  --help    Print help information
  -c  --config  Path to config.json

Installation

You can download the binary for your system from the Releases page. Once downloaded and gunzip'd, move it to a location in your path such as /usr/local/bin and make it executable. For instance, to download the MacOS binary for 64 bit platforms (this is most common):

LATEST="0.3.0"
curl -OL https://github.com/humanmade/go-anonymize-mysqldump/releases/download/$LATEST/go-anonymize-mysqldump_darwin_amd64.gz
gunzip go-anonymize-mysqldump_darwin_amd64.gz
mv go-anonymize-mysqldump_darwin_amd64 /usr/local/bin/anonymize-mysqldump
chmod +x /usr/local/bin/anonymize-mysqldump

Usage

This tool is designed to read a file stream over STDIN and produce an output over STDOUT. A config file is required and can be provided via the -c or --config flag. An example config for anonymizing a WordPress database is provided at config.example.json:

curl -LO https://raw.githubusercontent.com/humanmade/go-anonymize-mysqldump/master/config.example.json

Whenever the tool experiences an error, it will output a log to STDERR. If you wish to not see that output while the command is running, redirect it to some other file (or /dev/null if you don't care):

mysqldump -u yada -pbadpass -h db | anonymize-mysqldump --config config.json 2> path/to/errors.log > anonymized.sql

Caveats

Important things to be aware of!

Config File

An example config for anonymizing a WordPress database is provided at config.example.json.

The config is composed of many objects in the patterns array:

Constraints

Supposing you have a WordPress database and you need to modify certain meta, be it user meta, post meta, or comment meta. You can use constraints to update data only whenever a certain condition is matched. For instance, let's say you have a user meta key last_ip_address. If you wanted to change that value, you can use the following config in the fields array:

{
  "field": "meta_value",
  "position": 4,
  "type": "ipv4",
  "constraints": [
    {
      "field": "meta_key",
      "position": 3,
      "value": "last_ip_address"
    }
  ]
}

Field Types

Each column stores a certain type of data, be it a name, username, email, etc. The type property in the config is used to define the type of data stored, and ultimately the type of random data to be inserted into the field. https://github.com/dmgk/faker is used for generating the fake data. These are the types currently supported:

If you need another type, please feel free to add support and file a PR!

Credit

Many thanks to Automattic/go-search-replace for serving as the starting point for this tool! Also many thanks to xwb1989/sqlparser for the SQL parsing library. I wouldn't have been able to do this without it!