rosmo / mydumper-anon

Anonymize or modify data on the fly when creating MySQL data dumps using simple YAML configuration file
13 stars 4 forks source link

I can't get --anonymize to work #5

Closed BronsCharlson closed 7 years ago

BronsCharlson commented 8 years ago

Hi,

I'm trying to anonymize data while dumping. But the dump always has the original data without any modifications. Compiling looks good - I can't see any error:

`root@testdb:~/mydumper-anon-master# cmake . -DRUN_CPPCHECK Parse error in command line argument: -DRUN_CPPCHECK Should be: VAR:type=value CMake Error: No cmake script provided. CMake Error: Problem processing arguments. Aborting.

root@testdb:~/mydumper-anon-master# cmake . -DRUN_CPPCHECK=yes -- Using mysql-config: /usr/bin/mysql_config -- Found MySQL: /usr/include/mysql, /usr/lib/x86_64-linux-gnu/libmysqlclient_r.so;/usr/lib/x86_64-linux-gnu/libpthread.so;/usr/lib/x86_64-linux-gnu/libz.so;/usr/lib/x86_64-linux-gnu/libm.so;/usr/lib/x86_64-linux-gnu/libdl.so

CMake Warning at docs/CMakeLists.txt:9 (message): Unable to find Sphinx documentation generator


-- MYSQL_CONFIG = /usr/bin/mysql_config -- CMAKE_INSTALL_PREFIX = /usr/local -- BUILD_DOCS = ON -- WITH_BINLOG = OFF -- RUN_CPPCHECK = yes -- Change a values with: cmake -D=


-- Configuring done -- Generating done -- Build files have been written to: /root/mydumper-anon-master root@testdb:~/mydumper-anon-master# make Scanning dependencies of target mydumper [ 20%] Building C object CMakeFiles/mydumper.dir/mydumper.c.o [ 40%] Building C object CMakeFiles/mydumper.dir/server_detect.c.o [ 60%] Building C object CMakeFiles/mydumper.dir/g_unix_signal.c.o [ 80%] Building C object CMakeFiles/mydumper.dir/anonymizer.c.o Linking C executable mydumper [ 80%] Built target mydumper Scanning dependencies of target myloader [100%] Building C object CMakeFiles/myloader.dir/myloader.c.o Linking C executable myloader [100%] Built target myloader `

My YAML file looks like this:

`anonymizer_settings: randomize:

wordlist_id: file_with_words_one_per_line.txt

 firstname: /root/firstnames.txt
 lastname: /root/lastnames.txt
 fullname: /root/fullnames.txt

database1: user_masterdata: randomize: first_name: type: 'firstname'

I call mydumper like this:

`/root/mydumper-anon-master/mydumper --threads 1 --verbose 3 --database database1 --tables-list user_masterdata --outputdir /mnt/backup/testdb/ --anonymize /root/anonymize.yaml -u user -p password

root@testdb:~/mydumper-anon-master# /root/run_mydumper.sh * Message: Connected to a MySQL server * Message: Started dump at: 2016-07-20 12:14:44

* Message: Thread 1 connected using MySQL connection ID 4389 * Message: Non-InnoDB dump complete, unlocking tables * Message: Thread 1 dumping data for database1.user_masterdata * Message: Thread 1 dumping schema for database1.user_masterdata * Message: Thread 1 shutting down * Message: Finished dump at: 2016-07-20 12:14:47 `

Any idea what I'm missing?

kabakaev commented 8 years ago

Can you see '--anonymize' option in the output of mydumper --help?

BronsCharlson commented 8 years ago

Wow - thx for the fast reply. Yes I can see it:

--skip-tz-utc --use-savepoints Use savepoints to reduce metadata locking issues, needs SUPER privilege --success-on-1146 Not increment error count and Warning instead of Critical in case of table doesn't exist --lock-all-tables Use LOCK TABLE for all, instead of FTWRL -U, --updated-since Use Update_time to dump only tables updated in the last U days --trx-consistency-only Transactional consistency only --anonymize Data anonymization configuration file -h, --host The host to connect to -u, --user Username with privileges to run the dump -p, --password User password -P, --port TCP/IP port to connect to -S, --socket UNIX domain socket file to use for connection --defaults-file read mysql connection options from this file -t, --threads Number of threads to use, default 4 -C, --compress-protocol Use compression on the MySQL connection -V, --version Show the program version and exit -v, --verbose Verbosity of output, 0 = silent, 1 = errors, 2 = warnings, 3 = info, default 2

Do you have a working exapmle of a simple yaml-config file? Just for randomizing names for example?

I played around with the paths and filenames of --anonymize and the wordlists inside the yaml file. If I change paths to non existing files I recieve an error (which is good). So probably I have an error in my yaml configuration. My database is named database1 and I want to randomize first_name from this table:

mysql> desc user_masterdata;
+----------------------------+------------------+------+-----+---------------------+-----------------------------+
| Field                      | Type             | Null | Key | Default             | Extra                       |
+----------------------------+------------------+------+-----+---------------------+-----------------------------+
| id                         | int(10)          | NO   | PRI | NULL                | auto_increment              |
| user_id                    | int(10)          | NO   | UNI | 0                   |                             |
| first_name                 | varchar(100)     | NO   | MUL |                     |                             |
| last_name                  | varchar(100)     | NO   | MUL |                     |                             |
tsaridas commented 8 years ago

hey @BronsCharlson

try the below

databasename:
  mytable2:
    truncate: yes
  myTable:
    edit:
      - when: {column_name: '.+'}
        set: {column_name: ''}

careful of the yaml spacing.

for the file randomization you need to have the file in the same dir a mydumper and separate general config from database config. example

anonymizer_settings:
   randomize:
     # wordlist_id: file_with_words_one_per_line.txt
     firstname: firstnames.txt

databasename:
  mytable2:
    truncate: yes
  myTable:
    edit:
      - when: {column_name: '.+'}
        set: {column_name: ''}
  mytable3:
    randomize:
      column_name:
        type: 'firstname'
BronsCharlson commented 8 years ago

Thank you for your efforts. Still I don't have any success with randomizing (which I need). I stripped down anonymize.yaml to this:

`anonymizer_settings: randomize: firstname: firstnames.txt

db_name: user_masterdata: randomize: first_name: type: 'firstname' `

mydumper, anonymize.yaml and firstnames.txt are in the same directory, mydumper creates a sql-dump - but without any randomization of the values of column _firstname:

`root@testdb:~/mydumper-anon-master# ./mydumper --threads 1 --verbose 3 --database db_name --tables-list user_masterdata --outputdir /mnt/backup/testdb/ --anonymize anonymize.yaml -u root -p password * Message: Connected to a MySQL server * Message: Started dump at: 2016-07-21 11:50:30

* Message: Thread 1 connected using MySQL connection ID 6412 * Message: Non-InnoDB dump complete, unlocking tables * Message: Thread 1 dumping data for db_name.user_masterdata * Message: Thread 1 dumping schema for db_name.user_masterdata * Message: Thread 1 shutting down * Message: Finished dump at: 2016-07-21 11:50:34 `

What about general config from database config. example you mentioned? Do I need an additional config for the database? I couldn't find anything in the documentation.

rosmo commented 8 years ago

@BronsCharlson, can you try type: datetime and see if that works? There might be bug with randomizer lists, it's a feature I haven't tested very well. Did you test if any feature works at all (like truncate)?

BronsCharlson commented 8 years ago

Now I changed anonymize.yaml to:

`anonymizer_settings: randomize: firstname: firstnames.txt

db_name: user_masterdata: truncate: yes `

But the sql-dump is created anyhow.

I also tried type: datetime:

`anonymizer_settings: randomize: firstname: firstnames.txt

db_name: user_masterdata: randomize: first_name: type: 'datetime' ` but this didn't anonymize anything (also using it on a datetime field didn't help)

I'm clueless...

tsaridas commented 8 years ago

try this out

anonymizer_settings:
  randomize:
    firstname: firstnames.txt

db_name:
  user_masterdata:
    randomize:
      first_name:
        type: 'firstname'

Yaml syndax is important

BronsCharlson commented 8 years ago

Ha! You motivated me to try out writing YAML without tabs - but just with single spaces to ident. Now it worked!!! I will try to improve my script and then I will post an update. Just for the record - afaik YAML spaces need to be consistent - but can be tabs or single spaces. Is that correct?

tsaridas commented 8 years ago

Why does YAML forbid tabs?

Tabs have been outlawed since they are treated differently by different editors and tools. And since indentation is so critical to proper interpretation of YAML, this issue is just too tricky to even attempt. Indeed Guido van Rossum of Python has acknowledged that allowing TABs in Python source is a headache for many people and that were he to design Python again, he would forbid them.

http://www.yaml.org/faq.html

BronsCharlson commented 8 years ago

brave new world - thanks a lot!