microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.9k stars 579 forks source link

How to Add Custom Functions for Anonymization in Presidio Structured #1353

Closed ardhendu21 closed 8 months ago

ardhendu21 commented 8 months ago

Hello,

I'm currently working with the Presidio structured package for anonymizing personal information within pandas DataFrames .

However, I'm interested in extending this functionality by adding custom anonymization for entities which are not predefined. Also want to know how to remove the already defined custom entities.

Here is a snippet of my code which i followed from github.

import pandas as pd
from presidio_structured import StructuredEngine, PandasAnalysisBuilder
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker

pandas_engine = StructuredEngine()

sample_df = pd.DataFrame({'name': ['John Doe', 'Jane Smith'], 'email': ['john.doe@example.com', 'jane.smith@example.com']})

fake = Faker()
operators = {
    "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
}

try:
    tabular_analysis = PandasAnalysisBuilder().generate_analysis(sample_df)
    anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
    print(anonymized_df)
except Exception as e:
    print(f"Error during anonymization: {e}")

can someone help me with this?

omri374 commented 8 months ago

Hi, you can pass an AnalyzerEngine instance to your PandasAnalysisBuilder, and use the standard Presidio configuration capabilities to create recognizers and such.

For example:

import pandas as pd
from presidio_structured import StructuredEngine, PandasAnalysisBuilder
from presidio_anonymizer.entities import OperatorConfig
from faker import Faker

from presidio_analyzer import AnalyzerEngine, PatternRecognizer

operators = {
    "DEFAULT": OperatorConfig("replace", {"new_value": "<ANONYMIZED>"}),
    "PERSON": OperatorConfig("replace", {"new_value": "REDACTED"}),
    "EMAIL_ADDRESS": OperatorConfig("custom", {"lambda": lambda x: fake.safe_email()})
}

# input data
sample_df = pd.DataFrame({"title": ["Mr.", "Ms.", "Mrs."],"name": ["Arthur", "David", "William"], "sign": ["Plus", "Minus", "Minus"]})

# define custom PII detection (in this case with a deny-list)
titles_list = [
    "Sir",
    "Ma'am",
    "Madam",
    "Mr.",
    "Mrs.",
    "Ms.",
    "Miss",
    "Dr.",
    "Professor",
]

titles_recognizer = PatternRecognizer(supported_entity="TITLE", deny_list=titles_list)
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(titles_recognizer)

# Presidio structured
pandas_engine = StructuredEngine()
analysis_builder = PandasAnalysisBuilder(analyzer=analyzer)
tabular_analysis = analysis_builder.generate_analysis(sample_df)
anonymized_df = pandas_engine.anonymize(sample_df, tabular_analysis, operators=operators)
print(anonymized_df)

For creating new custom recognizers, and removing the existing, see the tutorial

omri374 commented 8 months ago

Closing the issue, feel free to open if you have any additional questions