rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

Not everything is dumpt #138

Closed Olivier-Vromans closed 4 months ago

Olivier-Vromans commented 5 months ago

Describe the bug I got a huge database around 183,7 MB from wordpress with wpecs tables. When I try to anonymize it everything goes fine until it comes to the dump step. It stops at 69% and it only writes 2736 lines with the end that says it is complete when the original file is 2048095 lines.

To Reproduce Steps to reproduce the behavior: pynonymizer -i ../database.sql -s wordpress_template.yml -o database_anon.sql

logs from the command: pynonymizer -i ../database.sql -s wordpress_template.yml -o database_anon.sql --verbose loading strategyfile wordpress_template.yml... Looking for locale nl_NL in provider faker.providers.address. Provider faker.providers.address has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.automotive. Provider faker.providers.automotive has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.bank. Provider faker.providers.bank has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.barcode. Specified locale nl_NL is not available for provider faker.providers.barcode. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.color. Specified locale nl_NL is not available for provider faker.providers.color. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.company. Provider faker.providers.company has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.credit_card. Specified locale nl_NL is not available for provider faker.providers.credit_card. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.currency. Provider faker.providers.currency has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.date_time. Provider faker.providers.date_time has been localized to nl_NL. Provider faker.providers.emoji does not feature localization. Specified locale nl_NL is not utilized for this provider. Provider faker.providers.file does not feature localization. Specified locale nl_NL is not utilized for this provider. Looking for locale nl_NL in provider faker.providers.geo. Specified locale nl_NL is not available for provider faker.providers.geo. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.internet. Specified locale nl_NL is not available for provider faker.providers.internet. Locale reset to en_US for this provider. Provider faker.providers.isbn does not feature localization. Specified locale nl_NL is not utilized for this provider. Looking for locale nl_NL in provider faker.providers.job. Specified locale nl_NL is not available for provider faker.providers.job. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.lorem. Provider faker.providers.lorem has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.misc. Specified locale nl_NL is not available for provider faker.providers.misc. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.passport. Specified locale nl_NL is not available for provider faker.providers.passport. Locale reset to en_US for this provider. Looking for locale nl_NL in provider faker.providers.person. Provider faker.providers.person has been localized to nl_NL. Looking for locale nl_NL in provider faker.providers.phone_number. Provider faker.providers.phone_number has been localized to nl_NL. Provider faker.providers.profile does not feature localization. Specified locale nl_NL is not utilized for this provider. Provider faker.providers.python does not feature localization. Specified locale nl_NL is not utilized for this provider. Provider faker.providers.sbn does not feature localization. Specified locale nl_NL is not utilized for this provider. Looking for locale nl_NL in provider faker.providers.ssn. Provider faker.providers.ssn has been localized to nl_NL. Provider faker.providers.user_agent does not feature localization. Specified locale nl_NL is not utilized for this provider. Database: (None:None)mysql@None name: wordpress_template_9bdbab544fc84d5a87feb8c1d7c34042 [CREATE_DB] [RESTORE_DB] Restoring: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184M/184M [00:26<00:00, 6.84MB/s] [ANONYMIZE_DB] creating seed table with 9 columns Inserting seed data Inserting seed data: 0%| | 0/150 [00:00<?, ?rows/s]Inserting seed row 0 Inserting seed row 1 Inserting seed row 2 Inserting seed data: 2%|███ | 3/150 [00:00<00:06, 22.90rows/s]Inserting seed row 3 Inserting seed row 4 Inserting seed row 5 Inserting seed data: 4%|██████ | 6/150 [00:00<00:06, 23.42rows/s]Inserting seed row 6 Inserting seed row 7 Inserting seed row 8 Inserting seed data: 6%|█████████ | 9/150 [00:00<00:05, 23.69rows/s]Inserting seed row 9 Inserting seed row 10 Inserting seed row 11 Inserting seed data: 8%|████████████ | 12/150 [00:00<00:05, 23.80rows/s]Inserting seed row 12 Inserting seed row 13 Inserting seed row 14 Inserting seed data: 10%|███████████████ | 15/150 [00:00<00:05, 23.90rows/s]Inserting seed row 15 Inserting seed row 16 Inserting seed row 17 Inserting seed data: 12%|██████████████████ | 18/150 [00:00<00:05, 23.94rows/s]Inserting seed row 18 Inserting seed row 19 Inserting seed row 20 Inserting seed data: 14%|█████████████████████▏ | 21/150 [00:00<00:05, 23.92rows/s]Inserting seed row 21 Inserting seed row 22 Inserting seed row 23 Inserting seed data: 16%|████████████████████████▏ | 24/150 [00:01<00:05, 23.94rows/s]Inserting seed row 24 Inserting seed row 25 Inserting seed row 26 Inserting seed data: 18%|███████████████████████████▏ | 27/150 [00:01<00:05, 23.78rows/s]Inserting seed row 27 Inserting seed row 28 Inserting seed row 29 Inserting seed data: 20%|██████████████████████████████▏ | 30/150 [00:01<00:05, 23.89rows/s]Inserting seed row 30 Inserting seed row 31 Inserting seed row 32 Inserting seed data: 22%|█████████████████████████████████▏ | 33/150 [00:01<00:04, 23.94rows/s]Inserting seed row 33 Inserting seed row 34 Inserting seed row 35 Inserting seed data: 24%|████████████████████████████████████▏ | 36/150 [00:01<00:04, 24.18rows/s]Inserting seed row 36 Inserting seed row 37 Inserting seed row 38 Inserting seed data: 26%|███████████████████████████████████████▎ | 39/150 [00:01<00:04, 24.20rows/s]Inserting seed row 39 Inserting seed row 40 Inserting seed row 41 Inserting seed data: 28%|██████████████████████████████████████████▎ | 42/150 [00:01<00:04, 24.09rows/s]Inserting seed row 42 Inserting seed row 43 Inserting seed row 44 Inserting seed data: 30%|█████████████████████████████████████████████▎ | 45/150 [00:01<00:04, 24.03rows/s]Inserting seed row 45 Inserting seed row 46 Inserting seed row 47 Inserting seed data: 32%|████████████████████████████████████████████████▎ | 48/150 [00:02<00:04, 24.11rows/s]Inserting seed row 48 Inserting seed row 49 Inserting seed row 50 Inserting seed data: 34%|███████████████████████████████████████████████████▎ | 51/150 [00:02<00:04, 24.00rows/s]Inserting seed row 51 Inserting seed row 52 Inserting seed row 53 Inserting seed data: 36%|██████████████████████████████████████████████████████▎ | 54/150 [00:02<00:04, 23.71rows/s]Inserting seed row 54 Inserting seed row 55 Inserting seed row 56 Inserting seed data: 38%|█████████████████████████████████████████████████████████▍ | 57/150 [00:02<00:03, 23.74rows/s]Inserting seed row 57 Inserting seed row 58 Inserting seed row 59 Inserting seed data: 40%|████████████████████████████████████████████████████████████▍ | 60/150 [00:02<00:03, 23.79rows/s]Inserting seed row 60 Inserting seed row 61 Inserting seed row 62 Inserting seed data: 42%|███████████████████████████████████████████████████████████████▍ | 63/150 [00:02<00:03, 23.77rows/s]Inserting seed row 63 Inserting seed row 64 Inserting seed row 65 Inserting seed data: 44%|██████████████████████████████████████████████████████████████████▍ | 66/150 [00:02<00:03, 23.72rows/s]Inserting seed row 66 Inserting seed row 67 Inserting seed row 68 Inserting seed data: 46%|█████████████████████████████████████████████████████████████████████▍ | 69/150 [00:02<00:03, 23.87rows/s]Inserting seed row 69 Inserting seed row 70 Inserting seed row 71 Inserting seed data: 48%|████████████████████████████████████████████████████████████████████████▍ | 72/150 [00:03<00:03, 23.89rows/s]Inserting seed row 72 Inserting seed row 73 Inserting seed row 74 Inserting seed data: 50%|███████████████████████████████████████████████████████████████████████████▌ | 75/150 [00:03<00:03, 22.30rows/s]Inserting seed row 75 Inserting seed row 76 Inserting seed row 77 Inserting seed data: 52%|██████████████████████████████████████████████████████████████████████████████▌ | 78/150 [00:03<00:03, 22.44rows/s]Inserting seed row 78 Inserting seed row 79 Inserting seed row 80 Inserting seed data: 54%|█████████████████████████████████████████████████████████████████████████████████▌ | 81/150 [00:03<00:03, 22.83rows/s]Inserting seed row 81 Inserting seed row 82 Inserting seed row 83 Inserting seed data: 56%|████████████████████████████████████████████████████████████████████████████████████▌ | 84/150 [00:03<00:02, 23.19rows/s]Inserting seed row 84 Inserting seed row 85 Inserting seed row 86 Inserting seed data: 58%|███████████████████████████████████████████████████████████████████████████████████████▌ | 87/150 [00:03<00:02, 23.52rows/s]Inserting seed row 87 Inserting seed row 88 Inserting seed row 89 Inserting seed data: 60%|██████████████████████████████████████████████████████████████████████████████████████████▌ | 90/150 [00:03<00:02, 23.62rows/s]Inserting seed row 90 Inserting seed row 91 Inserting seed row 92 Inserting seed data: 62%|█████████████████████████████████████████████████████████████████████████████████████████████▌ | 93/150 [00:03<00:02, 23.78rows/s]Inserting seed row 93 Inserting seed row 94 Inserting seed row 95 Inserting seed data: 64%|████████████████████████████████████████████████████████████████████████████████████████████████▋ | 96/150 [00:04<00:02, 24.06rows/s]Inserting seed row 96 Inserting seed row 97 Inserting seed row 98 Inserting seed data: 66%|███████████████████████████████████████████████████████████████████████████████████████████████████▋ | 99/150 [00:04<00:02, 24.01rows/s]Inserting seed row 99 Inserting seed row 100 Inserting seed row 101 Inserting seed data: 68%|██████████████████████████████████████████████████████████████████████████████████████████████████████ | 102/150 [00:04<00:02, 23.86rows/s]Inserting seed row 102 Inserting seed row 103 Inserting seed row 104 Inserting seed data: 70%|█████████████████████████████████████████████████████████████████████████████████████████████████████████ | 105/150 [00:04<00:01, 23.49rows/s]Inserting seed row 105 Inserting seed row 106 Inserting seed row 107 Inserting seed data: 72%|████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 108/150 [00:04<00:01, 23.69rows/s]Inserting seed row 108 Inserting seed row 109 Inserting seed row 110 Inserting seed data: 74%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 111/150 [00:04<00:01, 23.88rows/s]Inserting seed row 111 Inserting seed row 112 Inserting seed row 113 Inserting seed data: 76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 114/150 [00:04<00:01, 23.98rows/s]Inserting seed row 114 Inserting seed row 115 Inserting seed row 116 Inserting seed data: 78%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 117/150 [00:04<00:01, 23.92rows/s]Inserting seed row 117 Inserting seed row 118 Inserting seed row 119 Inserting seed data: 80%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 120/150 [00:05<00:01, 23.88rows/s]Inserting seed row 120 Inserting seed row 121 Inserting seed row 122 Inserting seed data: 82%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 123/150 [00:05<00:01, 23.86rows/s]Inserting seed row 123 Inserting seed row 124 Inserting seed row 125 Inserting seed data: 84%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 126/150 [00:05<00:00, 24.17rows/s]Inserting seed row 126 Inserting seed row 127 Inserting seed row 128 Inserting seed data: 86%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 129/150 [00:05<00:00, 24.06rows/s]Inserting seed row 129 Inserting seed row 130 Inserting seed row 131 Inserting seed data: 88%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 132/150 [00:05<00:00, 23.97rows/s]Inserting seed row 132 Inserting seed row 133 Inserting seed row 134 Inserting seed data: 90%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 135/150 [00:05<00:00, 23.85rows/s]Inserting seed row 135 Inserting seed row 136 Inserting seed row 137 Inserting seed data: 92%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 138/150 [00:05<00:00, 24.02rows/s]Inserting seed row 138 Inserting seed row 139 Inserting seed row 140 Inserting seed data: 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 141/150 [00:05<00:00, 23.89rows/s]Inserting seed row 141 Inserting seed row 142 Inserting seed row 143 Inserting seed data: 96%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 144/150 [00:06<00:00, 23.92rows/s]Inserting seed row 144 Inserting seed row 145 Inserting seed row 146 Inserting seed data: 98%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 147/150 [00:06<00:00, 23.92rows/s]Inserting seed row 147 Inserting seed row 148 Inserting seed row 149 Inserting seed data: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [00:06<00:00, 23.79rows/s] Anonymizing 1 tables Anonymizing wpecs_postmeta: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:11<00:00, 11.95s/it] dropping seed table Waiting for trailing operations to complete... [DUMP_DB] Dumping: 69%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 172M/248M [00:02<00:01, 66.5MB/s] [DROP_DB] Process complete!

database.anon.sql

Expected behavior I expect that the dump file is like the normal file with change of tables I put in my yml file.

Additional context Add any other context about the problem here.

rwnx commented 5 months ago

I'm not sure what's happening. perhaps there's something wrong with mysqldump?

You could try stopping at ANONYMIZE_DB (--stop-at-step) and dumping it yourself to see what happens?

Olivier-Vromans commented 5 months ago

So in the process when I don't skip the dumping step it stops the dump after around 70% and goes to next step without completing the dump. When I skip the dumping step and do it manually it works just fine.

rwnx commented 5 months ago

I'm afraid It's going to be difficult to identify without a clearer idea of what it's not dumping? i.e. what information is missing from the output?

I can say that the 70% part is misleading - that's just an estimate and with certain dbs it can be wrong.

Olivier-Vromans commented 4 months ago

After some more testing I found out it has something to do with TablePlus because when using mysql cli the import works fine. TablePlus freezes when importing a large database and only completes 5 tables. So the misleading 70% and the freezing of TablePlus made me first thing it was not working.