rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

Mechanisms to reduce load on server? #163

Closed simonblake-mp closed 3 months ago

simonblake-mp commented 3 months ago

Is your feature request related to a problem? Please describe.

Using v1 over the last year or so, I've had the odd situation where running pynonymizer against an AWS Aurora Mysql instance has caused the instance to run out of memory and spontaneously reboot, which isn't ideal, particularly as it often left a schema in a half processed state that pynonymizer would subsequently refuse to touch (Table _pynonymizer_seed_fake_data already exists). The most straightforward solution has been to increase the instance size to give it more memory.

Moving to v2 has caused an immediate increase in the OOM occurrences, even with workers=1 - so I'm having to bump up db instance sizes again.

Describe the solution you'd like Ideally maybe some knob we could twist to reduce database load (in return for having a longer runtime, obviously) - maybe the option to insert a sleep between queries?

Describe alternatives you've considered

Additional context Rowan, thank you for resolving the flurry of tickets I opened recently - I can confirm that environment variables named in the V1 style work fine for me in v2.2.1 🎉

rwnx commented 3 months ago

Could you add any info on the error messages you get? E.g how do you know it's out of memory?

This would really help for devising a solution.

I'm not opposed, I just want to understand more before we commit to a solution

Maybe there's some unexpected behaviour with the thread pool - it was supposed to be roughly the same as 1.x!

simonblake-mp commented 3 months ago

I don't get any error messages from pynonymizer, other than the standard mysql ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query error that the server is no longer reachable - which you'd kind of expect, as from the pov of the mysql client, the server simply stops responding, and after it comes back up after the reboot the TCP session will be reset.

That the instance ran out of memory is what is reported by the Aurora log on the AWS side - I don't have the actual error to hand, but from memory it was something reasonably descriptive - "I've fallen down due to running out of memory, now I'm rebooting".

It may just be that my databases have grown and the increase in errors that I'm attributing to v2 is really just schema growth - it's been a few months since I last had to rightsize a v1 target, so I was a bit suspicious of the issue cropping up immediately after going to v2.

rwnx commented 3 months ago

if you could test with 1.25 on your current db that would be really helpful!

simonblake-mp commented 3 months ago

I think we should probably close this one - I've done a bunch of testing with both v1 and v2 against Aurora mysql instances in AWS, and about all I can say with confidence is "Aurora mysql does occasionally OOM reboot when running pynonymizer against it". However, I don't have a case I can reliably reproduce - the same combination of db instance size, worker cpu/ram, pynoymizer versions/arguments can work fine a bunch of times in a row, then OOM randomly, then work fine again on subsequent runs.

My particular use case is a db instance with about 60 schema on it, where I run multiple (usually 4) pynonymizer instances in parallel against different schema. Even though the process works through the schema list alphabetically each time, the concurrency means the point loads on the db vary from run to run. However looking back through the logs for my anonymisation process, there's evidence that pynonymizer v1 used to sporadically cause OOM's when running without any parallelisation, so while the concurrency may elevate the risk of an OOM, it's not the root cause.

So instead, I ended up armouring my process to cope with the DB instance going AWOL for a few minutes - my process now

I'm not sure if it is sensible to restart a pynonymizer run that got interrupted - is that a reasonable thing to do?

rwnx commented 3 months ago

it depends on what you're anonymizing and how. e.g. if you are using a where clause or a literal where the data depends on something that was there before, it might not work (because the data might have been partially scrubbed!) If you're just replacing data it should be pretty safe!