Open Ilia-Kosenkov opened 1 year ago
Hi @Ilia-Kosenkov,
Adding a new operator to presidio-anonymizer currently requires re-packaging the code (see https://microsoft.github.io/presidio/anonymizer/adding_operators/).
What I do suggest, is to use the custom
operator instead. This code hasn't been tested on pyspark but should work:
def anonymize_text(text: str) -> str:
def operate(text: str) -> str:
return text[::-1]
analyzer_results = analyzer.analyze(text=text, language="en")
anonymized_results = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"EMAIL_ADDRESS": OperatorConfig(
"custom", {"lambda": operate}
)
},
)
return anonymized_results.text
This might be a cleaner solution, but I'm not 100% it works on pyspark with pandas_udf:
def operate(text: str) -> str:
return text[::-1]
def anonymize_text(text: str) -> str:
analyzer_results = analyzer.analyze(text=text, language="en")
anonymized_results = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"EMAIL_ADDRESS": OperatorConfig(
"custom", {"lambda": operate}
)
},
)
return anonymized_results.text
EDIT:
A hack to add the operator to the default list is the following. Note that this is a hack and should be tested:
from presidio_anonymizer.operators import Operator, OperatorType
class ReverseAnonymizer(Operator):
def operate(self, text: str, params: dict) -> str:
return text[::-1]
def validate(self, params: dict) -> None:
pass
def operator_name(self) -> str:
return "reverse"
def operator_type(self) -> OperatorType:
return OperatorType.Anonymize
anonymizer.operators_factory._anonymizers["reverse"] = ReverseAnonymizer
def anonymize_text(text: str) -> str:
analyzer_results = analyzer.analyze(text=text, language="en")
anonymized_results = anonymizer.anonymize(
text=text,
analyzer_results=analyzer_results,
operators={
"DEFAULT": OperatorConfig("reverse")
},
)
return anonymized_results.text
cc @shiranr in case she has anything to add.
Hey @omri374 , thanks for these examples. Indeed, the lambda path is the easiest one, but I was curious how it should be done properly (I assumed I was missing something or misread the docs).
The source of confusion is that an AnalyzerEngine
has .registry.add_recognizer()
method, allowing supplying custom analyzers, but an equivalent functionality in AnonymizerEngine
is missing. If there is a feature request somewhere for this functionality, I'd gladly +1 it :)
I have yet to try out the 'hack' on Spark, but locally I encountered an issue -- in my case, ._anonymizers
dictionary was not initialized at the moment of assignment, so I had to slightly modify the example:
anonymizer.operators_factory._anonymizers = { 'reverse' : ReverseAnonymizer() }
Adding custom operators into the operators factory is still an open issue. Any community contributions are welcome.
Should be solved in #1284
Hello, I am investigating the application of
presidio
to some of our internal data processing. I am interested in using it in Spark, and while the basic example works more or less reliably, I encountered an issue while applying custom anonymizingOperator
when executing anonymization in Spark.Here is a sample operator I have created
And this is how I use it
And a real example
In Spark I however get this:
I suspect this is related to how
OperatorsFactory
works, which basically indexes all of the types derived fromOperator
, but only when it is called, so when I broadcastanalyzer
andanonymizer
, this information is not captured. Is there a way I can embed operators' metadata intoanonymizer
prior to broadcasting? Or maybe I can explicitly broadcast an instance of the operator and then use it in.anonymize()
call?