rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
101 stars 38 forks source link

parallelise the anonymisation step #131

Closed r2omvavra closed 4 months ago

r2omvavra commented 11 months ago

Is your feature request related to a problem? Please describe. We're currently running only the step ANONYMIZE_DB on a rather large database. Doing so takes multiple days, mainly because of two tables. The idea was to split the strategy file in 3 parts (a general and one for each table) and run three processes. We think that should work and there shouldn't be any race conditions, except for the _pynonymizer_seed_fake_data_ table, which must only exist once.

Describe the solution you'd like Well, of course the ideal solution would be to introduce multi threading to the ANONYMIZE_DB step. This could be a valuable feature that speeds up the process for quite some use cases.

Describe alternatives you've considered For our more specific use case an acceptable solution would either to

Thank you for your efforts and for the great tool you provide!

rwnx commented 11 months ago

Hi @r2omvavra!

I wanted to come back to you to validate this. You're totally on to something and this would be an amazing feature to speed up the operation of pynonymizer.

At the moment I dont have a lot of time to build new features but I want to keep this issue open for future consideration and prioritization. If you or anyone you know can help with the development effort of this feature, PRs are always welcome and would be considered with respect and appreciation!

I dont have a timeline right now for this but if anything changes, I'll let you know in this issue.

r2omvavra commented 10 months ago

I gave it a go over the weekend, but being new to this code as well as being new to python makes it a little challenging at the moment. So don't expect a PR anytime soon :sweat_smile:

rwnx commented 4 months ago

Hi @r2omvavra, to let you know what's happening with this issue. I have an implementation in mind and I've added it to the v2.1.0 milestone, which is due end of next week.

It's going to be a cli flag for now, for backwards compatibility but in future it might become the default.

rwnx commented 4 months ago

this is now in main with the --workers flag. It will be released before the end of the next week. I'm closing this issue, keep an eye out for the 2.1.0 release