rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
97 stars 37 forks source link

pynoymizer doesn't seem to trap mysqldump failing? #169

Open simonblake-mp opened 1 week ago

simonblake-mp commented 1 week ago

Describe the bug Heya! This is an issue that I happened across while debugging the ticket I opened a few weeks back about AWS Aurora Mysql databases sometimes running out of RAM and rebooting while pynonymizer is running.

Recently I've seen a peculiar issue where dumps created by pynonymiser are incomplete - tables that exist in the source schema are missing from the anonymised dump. In the absence of anything obviously failing I instrumented my pynonymizer runs with verbose logging and added some logic to grizzle if an anonymized dump was significantly smaller than the same dump from the previous run.

When an incomplete dump is created, the anonymise step appears to complete correctly (ansi codes stripped, and table names redacted):

Truncating w:  91%|█████████ | 52/57 [00:03<00:00, 15.29it/s]      SET FOREIGN_KEY_CHECKS=0; TRUNCATE TABLE `w`; SET FOREIGN_KEY_CHECKS=1;
Truncating zi:  93%|█████████▎| 53/57 [00:03<00:00, 16.10it/s]SET FOREIGN_KEY_CHECKS=0; TRUNCATE TABLE `zi`; SET FOREIGN_KEY_CHECKS=1;
Truncating zo:  95%|█████████▍| 54/57 [00:03<00:00, 16.10it/s]  SET FOREIGN_KEY_CHECKS=0; TRUNCATE TABLE `zo`; SET FOREIGN_KEY_CHECKS=1;
Anonymizing y:  96%|█████████▋| 55/57 [00:03<00:00, 16.10it/s]UPDATE `y` SET `sandbox_mode` = 1;
Anonymizing y: 100%|██████████| 57/57 [00:03<00:00, 17.47it/s]

dropping a bunch of after script output for brevity, the log ends with this:

dropping seed table
DROP TABLE IF EXISTS `_pynonymizer_seed_fake_data`;
Waiting for trailing operations to complete...
[DUMP_DB]
SELECT data_bytes FROM (SELECT SUM(data_length) AS data_bytes FROM information_schema.tables WHERE table_schema = 'a') AS data;
Dumping:  31%|███▏      | 11.4M/36.4M [00:05<00:13, 1.89MB/s]mysqldump: Couldn't execute 'show table status like 'x'': Lost connection to MySQL server during query (2013)
Dumping:  31%|███▏      | 11.4M/36.4M [00:06<00:13, 1.87MB/s]
Skipped [DROP_DB]: (Skipping ([DROP_DB]))
Process complete!

Looks like the Aurora instance rebooted while mysqldump was running, and so the dump is incomplete (which is consistent with the actual dump file, which stops at table "w" - tables with names starting with x,y,z are all missing). Pynonymizer doesn't appear to notice this happening - it exits with a return code of 0

The obvious workaround (which I've done) is to to skip dumping in the pynonymiser run and instead call mysqldump after pynonymizer has finished in order to trap mysqldump failures. While that is fine for me, it isn't really a good general solution for everybody else.

To Reproduce I haven't tried this, but I imagine that if one had a decently large schema that took a while to dump, one could just restart the db, or firewall it off mid dump - anything to break mysqldump mid process?

Expected behavior Pynonymizer would trap the mysqldump failure, log an error, and set a non zero return code

Additional context I've about 60 different schema in the same Aurora instance, I run multiple pynonymizer instances concurrently against different schema, just with the ANONYMIZE and DUMP stages - ie

export PYNONYMIZER_START_AT='ANONYMIZE_DB'
export PYNONYMIZER_SKIP_STEPS='DROP_DB'

so the fact that an Aurora instance restart happened during a mysqldump doesn't necessarily mean that it's the mysqldump that caused the instance to run out of memory and restart - it could be an anonymisation process running on some other schema - it's presumably the aggregate load of all the anonymisations and dumps happening at the same time that causes the db to tank 🤷

rwnx commented 1 week ago

https://github.com/rwnx/pynonymizer/blob/main/pynonymizer/database/mysql/execution.py#L61C27-L61C32

I suspect it's the use of Popen here that means we dont get automatic throwing on bad exit code. this should be reasonably simple to fix (if returncode > 0 raise Error).

According to this https://stackoverflow.com/questions/5631624/how-to-get-exit-code-when-using-python-subprocess-communicate-method

we could do this with .call or switch the whole thing to subprocess.run, which might be more worthwhile in the long run.

postgres will also likely have the same issue.

I think your use case is a little strange and I dont normally expect the db to crash. Let's schedule this for the next minor version.

simonblake-mp commented 1 week ago

I 100% agree that my use case is strange and I'd personally far prefer my DB didn't crash sporadically! That said, there's many more prosaic reasons a long running process like mysqldump could fail mid flight (eg a network or firewalling failure), so it is reasonable to expect other users may be affected.

In addition to a non-zero return code, mysqldump also writes a more informative message to stderr - it may be be worthwhile trapping and returning that as well?