Broadcast custom `Operator` in Spark session

Ilia-Kosenkov commented 1 year ago

Hello, I am investigating the application of presidio to some of our internal data processing. I am interested in using it in Spark, and while the basic example works more or less reliably, I encountered an issue while applying custom anonymizing Operator when executing anonymization in Spark.

Here is a sample operator I have created

class ReverseAnonymizer(Operator):
    def operate(self, text: str, params: dict) -> str:
        return text[::-1]

    def validate(self, params: dict) -> None:
        pass

    def operator_name(self) -> str:
        return "reverse"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

And this is how I use it

def anonymize_text(text: str) -> str:
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={"DEFAULT": OperatorConfig("reverse")},
    )
    return anonymized_results.text

And a real example

input_str = "My name is John Smith and my number is 123456789."
anon_str = anonymize_text(input_str)

print(f"'{input_str}' -> '{anon_str}")
# 'My name is John Smith and my number is 123456789.' -> 'My name is htimS nhoJ and my number is 987654321.

In Spark I however get this:

presidio_anonymizer.entities.invalid_exception.InvalidParamException: Invalid operator class 'reverse'

I suspect this is related to how OperatorsFactory works, which basically indexes all of the types derived from Operator, but only when it is called, so when I broadcast analyzer and anonymizer, this information is not captured. Is there a way I can embed operators' metadata into anonymizer prior to broadcasting? Or maybe I can explicitly broadcast an instance of the operator and then use it in .anonymize() call?

omri374 commented 1 year ago

Hi @Ilia-Kosenkov,

Adding a new operator to presidio-anonymizer currently requires re-packaging the code (see https://microsoft.github.io/presidio/anonymizer/adding_operators/).

What I do suggest, is to use the custom operator instead. This code hasn't been tested on pyspark but should work:

def anonymize_text(text: str) -> str:
    def operate(text: str) -> str:
        return text[::-1]

    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={
            "EMAIL_ADDRESS": OperatorConfig(
                "custom", {"lambda": operate}
            )
        },
    )

    return anonymized_results.text

This might be a cleaner solution, but I'm not 100% it works on pyspark with pandas_udf:

def operate(text: str) -> str:
        return text[::-1]

def anonymize_text(text: str) -> str:
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={
            "EMAIL_ADDRESS": OperatorConfig(
                "custom", {"lambda": operate}
            )
        },
    )

    return anonymized_results.text

EDIT:

A hack to add the operator to the default list is the following. Note that this is a hack and should be tested:

from presidio_anonymizer.operators import Operator, OperatorType
class ReverseAnonymizer(Operator):
    def operate(self, text: str, params: dict) -> str:
        return text[::-1]

    def validate(self, params: dict) -> None:
        pass

    def operator_name(self) -> str:
        return "reverse"

    def operator_type(self) -> OperatorType:
        return OperatorType.Anonymize

anonymizer.operators_factory._anonymizers["reverse"] = ReverseAnonymizer

def anonymize_text(text: str) -> str:
    analyzer_results = analyzer.analyze(text=text, language="en")
    anonymized_results = anonymizer.anonymize(
        text=text,
        analyzer_results=analyzer_results,
        operators={
            "DEFAULT": OperatorConfig("reverse")
        },
    )

    return anonymized_results.text

cc @shiranr in case she has anything to add.

Ilia-Kosenkov commented 1 year ago

Hey @omri374 , thanks for these examples. Indeed, the lambda path is the easiest one, but I was curious how it should be done properly (I assumed I was missing something or misread the docs). The source of confusion is that an AnalyzerEngine has .registry.add_recognizer() method, allowing supplying custom analyzers, but an equivalent functionality in AnonymizerEngine is missing. If there is a feature request somewhere for this functionality, I'd gladly +1 it :)

I have yet to try out the 'hack' on Spark, but locally I encountered an issue -- in my case, ._anonymizers dictionary was not initialized at the moment of assignment, so I had to slightly modify the example:

anonymizer.operators_factory._anonymizers = { 'reverse' : ReverseAnonymizer() }

omri374 commented 1 year ago

Adding custom operators into the operators factory is still an open issue. Any community contributions are welcome.

omri374 commented 5 months ago

Should be solved in #1284

microsoft / presidio

Broadcast custom `Operator` in Spark session #1052