pvcy / presidio

MIT License
0 stars 0 forks source link

Push down empty string and null filtering from privacy-api #22

Closed tconbeer closed 1 year ago

tconbeer commented 1 year ago

From this comment by @willsthompson.

In the Privacy API's PII classification, we first modify the dataframe's Column object to replace empty strings with null values and then drop all null values before passing it to Presidio. We should push this transformation down into Presidio by replacing this:

https://github.com/pvcy/presidio/blob/4ef67722c1f6011b6b7c70f802533e9d819f3368/presidio-analyzer/presidio_analyzer/entity_source.py#L50

with this:

col = (
    series.replace('', np.nan, inplace=True)
        .dropna(inplace=True)
        .sample(sample_size, random_state=randomizer_seed)
)