rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
102 stars 38 forks source link

"Duplicate entry for key error" during anonymization / How to show executed SQL statement? #115

Closed b2cc closed 2 years ago

b2cc commented 2 years ago

We try to anonymize a table in our database for testing, however we run into unexpected behavior when running pynonymizer.

We run it with a fresh dump from production like this

python3 ./pynonymizer --input dump.sql.gz --strategy strategy.yaml --output anon.sql --db-type mysql --db-host 127.0.0.1 --db-port 3306 --db-name XXX --db-user XXX --db-password XXX

and we get this error:

[CREATE_DB]
[RESTORE_DB]
Restoring: 100%|██████████| 727M/727M [01:57<00:00, 6.18MB/s] 
[ANONYMIZE_DB]
creating seed table with 1 columns
Inserting seed data
Inserting seed data: 100%|██████████| 150/150 [00:00<00:00, 290.88rows/s]
Anonymizing 1 tables
ERROR 1062 (23000) at line 1: Duplicate entry '250c4390-6e2c-4ce7-8ae4-bc7b7e1c77ff-ACTION1' for key 't_table1.t_table1_uq_1'
Error while anonymizing table t_table1
Traceback (most recent call last):
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/database/mysql/__init__.py", line 165, in anonymize_database
    self.__runner.db_execute(statements)
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/database/mysql/execution.py", line 132, in db_execute
    self.__mask_subprocess_error(error)
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/database/mysql/execution.py", line 82, in __mask_subprocess_error
    raise error from None
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/database/mysql/execution.py", line 125, in db_execute
    subprocess.check_output(
  File "/usr/lib/python3.10/subprocess.py", line 420, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 524, in run
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['mysql', '-h', '127.0.0.1', '-P', '3306', '-u', 'anon', '-p******']' returned non-zero exit status 1.
Anonymizing t_table1: 100%|██████████| 1/1 [00:00<00:00, 267.90it/s]
Traceback (most recent call last):
  File "/home/b2c/.local/bin/pynonymizer", line 8, in <module>
    sys.exit(cli())
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/cli.py", line 271, in cli
    pynonymize(
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/pynonymize.py", line 144, in pynonymize
    db_provider.anonymize_database(strategy)
  File "/home/b2c/.local/lib/python3.10/site-packages/pynonymizer/database/mysql/__init__.py", line 178, in anonymize_database
    raise Exception("Error during anonymization")
Exception: Error during anonymization

The line ERROR 1062 (23000) at line 1: Duplicate entry '250c4390-6e2c-4ce7-8ae4-bc7b7e1c77ff-ACTION1' for key 't_table1.t_table1_uq_1' suggests that it is trying to update the field entry_fk with an UUID plus one of the actions defined in the strategy file, but why? What are we doing wrong?

Strategyfile, using the faker word provider faker.providers.lorem.Provider.word

locale: de_DE
tables:
  - table_name: t_table1
    type: update_columns
    columns:
      access_type:
        type: fake_update
        fake_type: word
        fake_args:
          ext_word_list: ['ACTION1', 'ACTION2', 'ACTION3', 'ACTION4']

The table is structured like this:

mysql> DESC t_table1;
+-------------+-------------+------+-----+---------+----------------+
| Field       | Type        | Null | Key | Default | Extra          |
+-------------+-------------+------+-----+---------+----------------+
| id          | bigint      | NO   | PRI | NULL    | auto_increment |
| entry_fk    | char(36)    | NO   | MUL | NULL    |                |
| access_type | varchar(25) | NO   |     | NULL    |                |
| version     | bigint      | NO   |     | NULL    |                |
| created_at  | datetime(3) | NO   |     | NULL    |                |
+-------------+-------------+------+-----+---------+----------------+

A usual entry in the table looks like this:

+----+--------------------------------------+--------------+---------+-------------------------+
| id | entry_fk                           | access_type  | version | created_at              |
+----+--------------------------------------+--------------+---------+-------------------------+
|  1 | 6bcb1d81-18b4-11ed-a540-b42e99e033d5 | ACTION1 |       0 | 2000-01-01 00:00:00.000 |
+----+--------------------------------------+--------------+---------+-------------------------+

Additionally we wanted to ask if there is a way to display the actual SQL query that was executed in case of such errors? We tried --verbose, but that didn't do it.

Thank you in advance for your support, b2cc

EDIT:

rwnx commented 2 years ago

Thanks for the info - I think i can see the problem: from the error, it seems like you have a unique key called t_table1_uq_1, which i would guess is made up of (entry_fk,access_type).

this would mean you had a duplicate entry, i.e. an existing row with the same entry_fk and the same access type together.

imagine you have this data:
+----+--------------------------------------+--------------+---------+-------------------------+
| id | entry_fk                             | access_type  | version | created_at              |
+----+--------------------------------------+--------------+---------+-------------------------+
|  1 | 6bcb1d81-18b4-11ed-a540-b42e99e033d5 | ACTION1      |       0 | 2000-01-01 00:00:00.000 |
|  2 | 6bcb1d81-18b4-11ed-a540-b42e99e033d5 | ACTION2      |       0 | 2000-01-01 00:00:00.000 |
+----+--------------------------------------+--------------+---------+-------------------------+

and id=2 is updated to access_type = 'ACTION1'

Additionally we wanted to ask if there is a way to display the actual SQL query that was executed in case of such errors? We tried --verbose, but that didn't do it.

This isn't possible at the moment, but it would be a really good feature request.

b2cc commented 2 years ago

Thanks for the quick reply - you're right, I didn't notice this key at first, but in the SHOW CREATE TABLE it's visible

UNIQUE KEY `t_table1_uq_1` (`entry_fk`,`access_type`),

Guess I need to work around that :)

The other feature would be nice to have, but this issue can be closed I guess. Thanks again and all the best, b2cc