sdv-dev / RDT

A library of Reversible Data Transforms
Other
117 stars 24 forks source link

AnonymizedFaker fails when using custom Faker provider #792

Closed rpc5102 closed 3 months ago

rpc5102 commented 5 months ago

Environment details

If you are already running RDT, please indicate the following details about the environment in which you are running it:

Problem description

Passing a custom provider to a transformer results in:

TransformerProcessingError: The 'my_providers.dummy' module does not contain a function named 'dummy'.
Refer to the Faker docs to find the correct function: https://faker.readthedocs.io/en/master/providers.html

What I already tried

I've created a dummy Faker provider using the example here: https://github.com/sdv-dev/SDV/issues/308#issuecomment-773290983

And have tried swapping transformers as in: https://github.com/sdv-dev/SDV/issues/1372

Placing this dummy provider directly in the Faker source folder faker/faker/providers/dummy works perfectly.

Sample code

import pandas as pd

from faker import Faker
from faker.config import PROVIDERS
from my_providers.dummy import Provider

fake = Faker()

fake.add_provider(Provider)
PROVIDERS.append("my_providers.dummy")

fake.get_providers()
[<my_providers.dummy.Provider at 0x12c216010>,
<faker.providers.DynamicProvider at 0x12c216190>,
<faker.providers.user_agent.Provider at 0x10f50e810>,
<faker.providers.ssn.en_US.Provider at 0x10b1ee650>,
<faker.providers.sbn.Provider at 0x104083ad0>,
<faker.providers.python.Provider at 0x12c1ba050>,
<faker.providers.profile.Provider at 0x10365d590>,
<faker.providers.phone_number.en_US.Provider at 0x109ae4310>,
<faker.providers.person.en_US.Provider at 0x11c65e390>,
<faker.providers.passport.en_US.Provider at 0x109ae4d50>,
<faker.providers.misc.en_US.Provider at 0x12c19bfd0>,
<faker.providers.lorem.en_US.Provider at 0x10f481710>,
<faker.providers.job.en_US.Provider at 0x10f481990>,
<faker.providers.isbn.Provider at 0x12c19a0d0>,
<faker.providers.internet.en_US.Provider at 0x1099e1190>,
<faker.providers.geo.en_US.Provider at 0x1046a7e50>,
<faker.providers.file.Provider at 0x1046ad610>,
<faker.providers.emoji.Provider at 0x12c161350>,
<faker.providers.dummy_m.Provider at 0x10aaefc10>,
<faker.providers.date_time.en_US.Provider at 0x12c1607d0>,
<faker.providers.currency.en_US.Provider at 0x12c160810>,
<faker.providers.credit_card.en_US.Provider at 0x12c160e90>,
<faker.providers.company.en_US.Provider at 0x12c183d10>,
<faker.providers.color.en_US.Provider at 0x12c1608d0>,
<faker.providers.barcode.en_US.Provider at 0x12c161250>,
<faker.providers.bank.en_GB.Provider at 0x12c161650>,
<faker.providers.automotive.en_US.Provider at 0x12c161690>,
<faker.providers.address.en_US.Provider at 0x104667990>]
fake.dummy()

'bar'

from sdv.metadata import SingleTableMetadata
from sdv.single_table import GaussianCopulaSynthesizer

# making fake list of words
data = []

for _ in range(5):
    data.append(fake.word())

df = pd.DataFrame(data=data)
df = df.rename(columns={0: "words"}).reset_index(drop=True)

# get metadata from df
metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data=df)
metadata.update_column(column_name="words", sdtype="text")
{
    "METADATA_SPEC_VERSION": "SINGLE_TABLE_V1",
    "columns": {
        "words": {
            "sdtype": "text"
        }
    }
}
synthesizer = GaussianCopulaSynthesizer(metadata)

from rdt.transformers.pii import AnonymizedFaker

synthesizer.auto_assign_transformers(df)

synthesizer.update_transformers(
    column_name_to_transformer={
        "words": AnonymizedFaker(
            provider_name="my_providers.dummy", function_name="dummy"
        )
    }
)

AttributeError: module 'faker.providers' has no attribute 'my_providers'

What works

Adding my custom provider to Faker's attribute namespace fixes the problem. The issue seems to stem from thecheck_provider_function check added in this commit: https://github.com/sdv-dev/RDT/commit/5e577fb39a328c70e3fc5fe7960e0d3511a20ab4#diff-c21909dc41931197bebb5afac4f76cd4c014fd9063d3d205ced9c5b2f4612ca6R55

faker.providers.my_providers = my_providers
attrgetter("my_providers")(faker.providers)

synthesizer.get_transformers()

{'words': AnonymizedFaker(provider_name='my_providers.dummy', function_name='dummy')}

Am I doing something silly?

srinify commented 3 months ago

Hi there @rpc5102 👋 At the moment, we actually don't officially support the use of custom functions in AnonymizedFaker. Sorry for the confusion here!

I've opened a feature request issue here because it would be great to add. If you want to head over there and comment with your specific use case or data type, that would be awesome. Over time, other users can comment and share their use case as well in that feature request issue.

I'll close this one out so we can focus the discussion over there. Thanks!