rheinwerk-verlag / pganonymize

A commandline tool for anonymizing PostgreSQL databases
http://pganonymize.readthedocs.io/
Other
42 stars 26 forks source link

How to use faker’s localized providers? #47

Closed vincent-hatakeyama closed 1 year ago

vincent-hatakeyama commented 1 year ago

Hi,

Is it possible to use faker’s localized providers?

I need to use this one for example: https://faker.readthedocs.io/en/master/locales/fr_FR.html#faker.providers.ssn.fr_FR.Provider

Regards

hkage commented 1 year ago

Hi,

with the current version of pganonymize the Faker library will always be initialized with the default locale en_US.

To be able to use localized providers the locale should be added as an optional argument within the YAML schema definition or as an additional property for the FakeProvider. This is currently not supported but it is a great idea to get access to the localized providers as we would also possibly use localized data like VAT-IDs or states.

I will take a look into that, thank you for reporting / requesting this feature.

Regards, Henning

hkage commented 1 year ago

I suppose the main difficulty for the implementation is a performance issue: if we pass the locale on a table's field level within the YAML schema and instantiate the Faker class for each table record (instead of module wide), this would result in a poor execution time, e.g.:

import timeit

>>> timeit.timeit('faker.first_name()', setup="import faker; faker = faker.Faker()", number=1000)
<<< 0.3215181827545166

>>> timeit.timeit('faker.Faker().first_name()', setup="import faker", number=1000)
<<< 14.740003108978271

So I guess the only way to prevent the initialization on record level is to provide something like global provider options within the YAML schema that will be passed to a single and reusable Faker instance, that will be used for all records, like this:

tables:
 - address:
    fields:
     - first_name:
        provider:
          name: fake.first_name
     - last_name:
         provider:
           name: fake.last_name
     - vat_id:
         provider:
           name: fake.ssn

options:
  faker:
    locales:
      - de_DE
      - fr_FR

Faker's multi localization mode could be also used to provide more than one locale, but this would also mean that common generator methods like first_name or last_name will result in random names (according to the locale order).

hkage commented 1 year ago

The localization feature will be part of the upcoming release 0.10.0 - thanks to @BuddhaOhneHals for the contribution.