nxs-data-anonymizer is a tool for anonymizing PostgreSQL and MySQL/MariaDB/Percona databases' dump.
Development and project teams which are dealing with production and test/dev/stage or dynamic namespaces with databases and need to ensure security and prevent data leaks.
Inspect your database structure and set up the nxs-data-anonymizer config in accordance with the sensitive data you need to anonymize.
You are able to use this tool in any way you want. Three most common ways are described below.
In order to operate with your database anonymization via console you need to go through the following steps:
For example, use the following command if you need to anonymize your PostgreSQL database from production to dev on fly (PostgreSQL Client need to be installed):
export PGPASSWORD=password; pg_dump -U postgres prod | /path/to/nxs-data-anonymizer -t pgsql -c /path/to/nxs-data-anonymizer.conf | psql -U postgres dev
This section describes how to integrate nxs-data-anonymizer into your GitLab CI. You may add jobs presented below into your .gitlab-ci.yml
and adjust it for yourself.
Job described in this section is able to perform the following tasks:
main
branch is setproduction
database dump, anonymize and upload it into s3 bucketJob sample:
anonymize:
stage: anonymize
image: nixyslab/nxs-data-anonymizer:latest
variables:
GIT_STRATEGY: none
PG_HOST: ${PG_HOST_PROD}
PG_USER: ${PG_USER_PROD}
PGPASSWORD: ${PG_PASS_PROD}
before_script:
- echo "${S3CMD_CFG}" > ~/.s3cmd
- echo "${NXS_DA_CFG}" > /nxs-data-anonymizer.conf
script:
- pg_dump -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE} | /nxs-data-anonymizer -t pgsql -c /nxs-data-anonymizer.conf | gzip | s3cmd put - s3://bucket/anondump.sql.gz
only:
- /^v.*$/
except:
- branches
- merge_requests
Job described in this section deals with the following:
stage
branchstage
databaseJob sample:
restore-stage:
stage: restore
image: nixyslab/nxs-data-anonymizer:latest
variables:
GIT_STRATEGY: none
PG_HOST: ${PG_HOST_STAGE}
PG_USER: ${PG_USER_STAGE}
PGPASSWORD: ${PG_PASS_STAGE}
before_script:
- echo "${S3CMD_CFG}" > ~/.s3cmd
script:
- s3cmd --no-progress --quiet get s3://bucket/anondump.sql.gz - | gunzip | psql -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE}
only:
- stage
when: manual
This section contains a description of CI/CD variables used in GitLab CI job samples above.
Variable | Description |
---|---|
S3CMD_CFG |
S3 storage config |
PG_SCHEMA |
PgSQL schema |
PG_DATABASE |
PgSQL database name |
Variable | Description |
---|---|
NXS_DA_CFG |
nxs-data-anonymizer config |
PG_HOST_PROD |
PgSQL host |
PG_USER_PROD |
PgSQL user |
PG_PASS_PROD |
PgSQL password |
Variable | Description |
---|---|
PG_HOST_STAGE |
PgSQL host |
PG_USER_STAGE |
PgSQL user |
PG_PASS_STAGE |
PgSQL password |
This section describes how to integrate nxs-data-anonymizer into your GitHub Actions. You may add jobs presented below into your .github
workflows and adjust it for yourself.
Job described in this section is able to perform the following tasks:
production
database dump, anonymize and upload it into s3 bucketon:
push:
tags:
- v*.*
jobs:
anonymize:
runs-on: ubuntu-latest
container:
image: nixyslab/nxs-data-anonymizer:latest
env:
PG_HOST: ${{ secrets.PG_HOST_PROD }}
PG_USER: ${{ secrets.PG_USER_PROD }}
PGPASSWORD: ${{ secrets.PG_PASS_PROD }}
PG_SCHEMA: ${{ secrets.PG_SCHEMA }}
PG_DATABASE: ${{ secrets.PG_DATABASE }}
steps:
- name: Create services configs
run: |
echo "${{ secrets.S3CMD_CFG }}" > ~/.s3cmd
echo "${{ secrets.NXS_DA_CFG }}" > /nxs-data-anonymizer.conf
- name: Anonymize
run: |
pg_dump -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE} | /nxs-data-anonymizer -t pgsql -c /nxs-data-anonymizer.conf | gzip | s3cmd put - s3://bucket/anondump.sql.gz
Job described in this section deals with the following:
stage
databaseon: workflow_dispatch
jobs:
restore-stage:
runs-on: ubuntu-latest
container:
image: nixyslab/nxs-data-anonymizer:latest
env:
PG_HOST: ${{ secrets.PG_HOST_STAGE }}
PG_USER: ${{ secrets.PG_USER_STAGE }}
PGPASSWORD: ${{ secrets.PG_PASS_STAGE }}
PG_SCHEMA: ${{ secrets.PG_SCHEMA }}
PG_DATABASE: ${{ secrets.PG_DATABASE }}
steps:
- name: Create services configs
run: |
echo "${{ secrets.S3CMD_CFG }}" > ~/.s3cmd
- name: Restore
run: |
s3cmd --no-progress --quiet get s3://bucket/anondump.sql.gz - | gunzip | psql -h ${PG_HOST} -U ${PG_USER} --schema=${PG_SCHEMA} ${PG_DATABASE}
This section contains a description of secrets used in GitHub Actions job samples above.
Variable | Description |
---|---|
S3CMD_CFG |
S3 storage config |
PG_SCHEMA |
PgSQL schema |
PG_DATABASE |
PgSQL database name |
Variable | Description |
---|---|
NXS_DA_CFG |
nxs-data-anonymizer config |
PG_HOST_PROD |
PgSQL host |
PG_USER_PROD |
PgSQL user |
PG_PASS_PROD |
PgSQL password |
Variable | Description |
---|---|
PG_HOST_STAGE |
PgSQL host |
PG_USER_STAGE |
PgSQL user |
PG_PASS_STAGE |
PgSQL password |
Default configuration file path: /nxs-data-anonymizer.conf
. The file is represented in yaml.
Argument | Short | Required | Having value | Default value | Description |
---|---|---|---|---|---|
--help |
-h |
No | No | - | Show program help message |
--version |
-v |
No | No | - | Show program version |
--conf |
-c |
No | Yes | /nxs-data-anonymizer.conf |
Configuration file path |
--input |
-i |
No | Yes | - | File to read data from. If not specified stdin will be used |
--log-format |
-l |
No | Yes | json |
Log file format. You are available to use either json or plain value |
--output |
-o |
No | Yes | - | File to write data to. If not specified stdout will be used |
--type |
-t |
Yes | Yes | - | Database dump file type. Available values: pgsql , mysql |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
logfile |
String | No | stderr |
Log file path. You may also use stdout and stderr |
loglevel |
String | No | info |
Log level. Available values: debug , warn , error and info |
progress |
Progress | No | - | Anonymization progress logging |
variables |
Map of Variables (key: variable name) | No | - | Global variables to be used in a filters. Variables are set at the init of application and remain unchanged during the runtime |
link |
Slice of Link | No | - | Rules to link specified columns across the database |
filters |
Map of Filters (key: table name) | No | - | Filters set for specified tables (key as a table name). Note: for PgSQL you also need to specify a scheme (e.g. public.tablename ) |
security |
Security | No | - | Security enforcement for anonymizer |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
rhythm |
String | No | 0s |
Frequency write into the log a read bytes count. Progress will be written to the log only when this option is specified and has none-zero value. You may use a human-readable values (e.g. 30s , 5m , etc) |
humanize |
Bool | No | false |
Set this option to true if you need to write into the log a read bytes count in a human-readable format. On false raw bytes count will be written to the log |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
type |
String | No | template |
Type of field value : template and command are available |
value |
String | Yes | - | The value to be used as global variable value within the filters. In accordance with the type this value may be either Go template or command . See below for details |
Link is used to create the same data with specified rules for different cells across the database.
Each link element has following properties:
Option | Type | Required | Default value | Description |
---|---|---|---|---|
type |
String | No | template |
Type of field value : template and command are available |
value |
String | Yes | - | The value to be used to replace at every cell in specified column. In accordance with the type this value may be either Go template or command . See below for details |
unique |
Bool | No | false |
If true checks the generated value for cell is unique whole an all columns specified for link element |
Filters description for specified table.
Option | Type | Required | Default value | Description |
---|---|---|---|---|
columns |
Map of Columns (key: column name) | No | - | Filter rules for specified columns of table (key as a column name) |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
type |
String | No | template |
Type of field value : template and command are available |
value |
String | Yes | - | The value to be used to replace at every cell in specified column. In accordance with the type this value may be either Go template or command . See below for details |
unique |
Bool | No | false |
If true checks the generated value for cell is unique whole the column |
Go template
To anonymize a database fields you may use a Go template with the Sprig template library's functions.
Additional filter functions:
null
: set a field value to NULL
isNull
: compare a field value with NULL
drop
: drop whole row. If table has filters for several columns and at least one of them returns drop value, whole row will be skipped during the anonymization processYou may also use the following data in a templates:
{{ .TableName }}
{{ .CurColumnName }}
{{ .Values.COLUMN_NAME }}
(e.g.: {{ .Values.username }}
){{ .Variables.VARIABLE_NAME }}
(e.g.: {{ .Variables.password }}
) {{ .ColumnTypeRaw }}
range
or index
to access specific element. Statement: {{ index .ColumnTypeGroups 0 0 }}
. See Types for detailsCommand
To anonymize a database fields you may use a commands (scripts or binaries) with any logic you need. The command's concept has following properties:
stdout
will be used as a new value for the anonymized fieldstderr
will be used as an error text)ENVVARTABLE
: contains a name of the current tableENVVARCURCOLUMN
: contains the current column nameENVVARCOLUMN_{COLUMN_NAME}
: contains values (before substitutions) for all columns for the current rowENVVARGLOBAL_{VARIABLE_NAME}
: contains value for specified global variableENVVARCOLUMNTYPERAW
: contains raw column data typeENVVARCOLUMNTYPEGROUP_{GROUP_NUM}_{SUBGROUPNUM}
: contains regex's capturing groups for the column data type. See Types for detailsOption | Type | Required | Default value | Description |
---|---|---|---|---|
policy |
Policy | No | - | Security policy for entities |
exceptions |
Exceptions | No | - | Exceptions for entities |
defaults |
Defaults | No | - | Default filters for entities |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
tables |
String | No | pass |
Security policy for tables. If value skip is used all undescribed tables in config will be skipped while anonymization |
columns |
String | No | pass |
Security policy for columns. If value randomize is used all undescribed columns in config will be randomized (with default rules in accordance to types) while anonymization |
Values to masquerade a columns in accordance with the types see below.
PgSQL:
Type | Value to masquerade |
---|---|
smallint |
0 |
integer |
0 |
bigint |
0 |
smallserial |
0 |
serial |
0 |
bigserial |
0 |
decimal |
0.0 |
numeric |
0.0 |
real |
0.0 |
double |
0.0 |
character |
randomized character data" |
bpchar |
randomized bpchar data |
text |
randomized text data |
MySQL:
Type | Value to masquerade |
---|---|
bit |
0 |
bool |
0 |
boolean |
0 |
tinyint |
0 |
smallint |
0 |
mediumint |
0 |
int |
0 |
integer |
0 |
bigint |
0 |
float |
0.0 |
double |
0.0 |
double precision |
0.0 |
decimal |
0.0 |
dec |
0.0 |
char |
randomized char (String will be truncated to "COLUMN_SIZE" length.) |
varchar |
randomized varchar (String will be truncated to "COLUMN_SIZE" length.) |
tinytext |
randomized tinytext |
text |
randomized text |
mediumtext |
randomized mediumtext |
longtext |
randomized longtext |
enum |
Last value from enum |
set |
Last value from set |
date |
2024-01-01 |
datetime |
2024-01-01 00:00:00 |
timestamp |
2024-01-01 00:00:00 |
time |
00:00:00 |
year |
2024 |
json |
{"randomized": "json_data"} |
binary |
cmFuZG9taXplZCBiaW5hcnkgZGF0YQo= |
varbinary |
cmFuZG9taXplZCBiaW5hcnkgZGF0YQo= |
tinyblob |
cmFuZG9taXplZCBiaW5hcnkgZGF0YQo= |
blob |
cmFuZG9taXplZCBiaW5hcnkgZGF0YQo= |
mediumblob |
cmFuZG9taXplZCBiaW5hcnkgZGF0YQo= |
longblob |
cmFuZG9taXplZCBiaW5hcnkgZGF0YQo= |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
tables |
Slice of strings | No | - | Table names without filters which are not be skipped while anonymization if option security.policy.tables set to skip |
columns |
Slice of strings | No | - | Column names (in any table) without filters which are not be randomized while anonymization if option security.policy.columns set to randomize |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
columns |
Map of Filters | No | - | Default filter for columns (in any table). That filters will be applied for columns with this names without described filters |
types |
Slice of Types | No | - | Custom filters for types (in any table). With this filter rules you may override default filters for types |
Option | Type | Required | Default value | Description |
---|---|---|---|---|
regex |
String | Yes | - | Regular expression. Will be checked for match for column data type (in CREATE TABLE section). Able to use capturing groups within the regex that available as an additional variable data in the filters (see Columns for details). This ability helps to create more flexible rules to generate the cells value in accordance with data type features |
rule |
Columns | Yes | - | Rule will be applied columns with data types matched for specified regular expression |
Imagine you have a simple database with two tables users
and posts
in your production PgSQL like this:
id | username | password | api_key |
---|---|---|---|
1 | admin |
ZjCX6wUxtXIMtip |
epezyj0cj5rqrdtxklnzxr3f333uibtz6avek7926141t1c918 |
2 | alice |
tuhjLkgwwetiwf8 |
2od4vfsx2irj98hgjaoi6n7wjr02dg79cvqnmet4kyuhol877z |
3 | bob |
AjRzvRp3DWo6VbA |
owp7hob5s3o083d5hmursxgcv9wc4foyl20cbxbrr73egj6jkx |
id | poster_id | title | content |
---|---|---|---|
1 | 1 | example_post_1 |
epezyj0cj5rqrdtxklnzxr3f333uibtz6avek7926141t1c918 |
2 | 2 | example_post_2 |
2od4vfsx2irj98hgjaoi6n7wjr02dg79cvqnmet4kyuhol877z |
3 | 3 | example_post_3 |
owp7hob5s3o083d5hmursxgcv9wc4foyl20cbxbrr73egj6jkx |
4 | 1 | example_post_4 |
epezyj0cj5rqrdtxklnzxr3f333uibtz6avek7926141t1c918 |
5 | 2 | example_post_5 |
2od4vfsx2irj98hgjaoi6n7wjr02dg79cvqnmet4kyuhol877z |
6 | 3 | example_post_6 |
owp7hob5s3o083d5hmursxgcv9wc4foyl20cbxbrr73egj6jkx |
You need to get a dump with fake values:
admin
: preset fixed value for a password and API key to avoid the need to change an app settings in your dev/test/stage or local environment after downloading the dump.user_N
(where N
it is a user ID) and unique random passwords and API keys.users
and posts
tables in id
and poster_id
columnscontent
column.
In accordance with these conditions, the nxs-data-anonymizer config may look like this:variables:
#Global variables.
adminPassword:
type: template
value: "preset_admin_password"
adminAPIKey:
value: "preset_admin_api_key"
#Block defining rules of behavior with fields and tables for which filters are not specified.
security:
# Specifies the required actions for tables and columns that are not specified in the configuration.
policy:
tables: skip
columns: randomize
# Excludes policy actions for the specified tables and columns.
exceptions:
tables:
- public.posts
columns:
- title
# Overrides the default policy actions for the columns specified in this block. The value is generated once and substituted into all instances of the field.
defaults:
columns:
content:
value: "{{- randAlphaNum 20 -}}"
#Here you define the rules that allow you to preserve the mapping of values between tables.
link:
- rule:
#Value generation rule.
value: "{{ randInt 1 15 }}"
unique: true
with:
#Tables and columns to which the rule is applied.
public.users:
- id
public.posts:
- poster_id
#Block describing replacement rules for fields.
filters:
public.users:
columns:
username:
value: "{{ if eq .Values.username \"admin\" }}{{ .Values.username }}{{ else }}user_{{ .Values.id }}{{ end }}"
password:
type: command
value: /path/to/script.sh
unique: true
api_key:
value: "{{ if eq .Values.username \"admin\" }}{{ .Variables.adminAPIKey }}{{ else }}{{- randAlphaNum 50 | nospace | lower -}}{{ end }}"
unique: true
The /path/to/script.sh
script content is following:
#!/bin/bash
# Print preset password if current user is admin
if [ "$ENVVARCOLUMN_username" == "admin" ];
then
echo -n "$ENVVARGLOBAL_adminPassword"
exit 0
fi
# Generate password for other users
p=$(pwgen -s 5 1 2>&1)
if [ ! $? -eq 0 ];
then
# On error print message to stderr and exit with non zero code
echo -n "$p" >&2
exit 1
fi
# Print generated password
echo $p | tr -d '\n'
exit 0
Now you may execute the following command in order to load anonymized data into your dev DB:
pg_dump ... | ./nxs-data-anonymizer -c filters.conf | psql -h localhost -U user example
As a result: | id | username | password | api_key |
---|---|---|---|---|
5 | admin |
preset_admin_password |
preset_admin_api_key |
|
4 | user_2 |
Pp4HY |
dhx4mccxyd8ux5uf1khpbqsws8qqeqs4efex1vhfltzhtjcwcu |
|
7 | user_3 |
vu5TW |
lgkkq3csskuyew8fr52vfjjenjzudokmiidg3cohl2bertc93x |
id | poster_id | title | content |
---|---|---|---|
1 | 5 | example_post_1 |
EDlT6bGXJ2LOS7CE2E4b |
2 | 4 | example_post_2 |
EDlT6bGXJ2LOS7CE2E4b |
3 | 7 | example_post_3 |
EDlT6bGXJ2LOS7CE2E4b |
4 | 5 | example_post_4 |
EDlT6bGXJ2LOS7CE2E4b |
5 | 4 | example_post_5 |
EDlT6bGXJ2LOS7CE2E4b |
6 | 7 | example_post_6 |
EDlT6bGXJ2LOS7CE2E4b |
It's easy. You can find more examples in doc/examples.
Following features are already in backlog for our development team and will be released soon:
For support and feedback please contact me:
For news and discussions subscribe the channels:
nxs-data-anonymizer is released under the Apache License 2.0.