rwnx / pynonymizer

A universal tool for translating sensitive production database dumps into anonymized copies.
https://pypi.org/project/pynonymizer/
MIT License
102 stars 38 forks source link

Support for Third Party Faker Providers #72

Closed theprestigedog closed 3 years ago

theprestigedog commented 3 years ago

Is your feature request related to a problem? Please describe. I've noticed the documentation mentions that: "You can specify a fake update with any default Faker provider." Is it possible to add a way to utilise third-party/community Faker providers?

Describe the solution you'd like Support for third-party/community Faker providers.

Describe alternatives you've considered There doesn't seem to be an alternative currently.

Additional context NA

rwnx commented 3 years ago

To expand on this a little, I'd like to understand the problem being solved here - pynonymizer uses faker to generate the seed data for anonymization, but i'd say the exact way it behaves (and it's subsequent binding to faker) was more of an implementation detail.

To be clear though, i support the case for adding more custom-provided generators - adding support for 3rd party faker providers raises implementation questions like:

Faker's CLI docs seem to indicate the following usage: https://github.com/joke2k/faker/#command-line-usage

-i {my.custom_provider other.custom_provider} list of additional custom providers to use. Note that is the import path of the package containing your Provider class, not the custom Provider class itself.

Now i'm not against that specifically, but i'd be happier knowing that we were adding a feature that would enable extensibility as a general feature and wouldn't need to track specifically with faker, but i'm curious to hear your thoughts here!

theprestigedog commented 3 years ago

Thanks for the response Jerome.

The problem we're trying to solve is that we have a large database with some non-traditional data located inside a single column. The format of the data looks something like this:

Firstname Lastname <Firstname.Lastname@email.com>

From what I've seen from the pynonymizer package, there doesn't seem to be a way to filter this column using the strategy.yml to provide randomised rows based on this format. I could well be wrong here though.

If that's the case, it seems the easiest way for me to achieve this might be to create a specialised Faker provider that I could hopefully point pynonymizer toward and be able to utilise it via a specific fake_type in the strategy.yml.

Happy for you to tell me there's a much easier way here though!

rwnx commented 3 years ago

There was some relevant discussion in #62 about combined fields from faker fields and data consistency. It doesn't look like it'll fit with this use case but it might help indicate where we're at.

As far as dirty workarounds go you could reference data in the seed table in a literal i.e. by concatenating stuff together. But it's gross and definitely brittle. 😅

I agree that the ability to use custom generators might be answer here, it'll also bake in a code interface to custom data formats, and think that can only be a good thing. I'll take a look at this as a feature and update here.

rwnx commented 3 years ago

I've added this feature in #75 which should release in 1.21.0

rwnx commented 3 years ago

OK so this is out with v1.21.0! You can check out the docs to see what the expected usage is, but take a look at the mysql integration test also, since I've based this on the usecase here:

a custom provider: https://github.com/jerometwell/pynonymizer/blob/master/tests_integration/mysql/custom_provider.py referencing that provider in the strategyfile: https://github.com/jerometwell/pynonymizer/blob/master/tests_integration/mysql/sakila.yml

If you could review and close the issue if it's resolved, otherwise we can continue the discussion here 😇

theprestigedog commented 3 years ago

Brilliant, cheers Jerome. This sorts our use case out perfectly.

Cracking Python package you've put together here.