Closed AkshayDube11 closed 3 years ago
Hi @AkshayDube11, we'll soon post a sample on how Presidio can be applied on structured data, such as Pandas DataFrames.
Hi @AkshayDube11, while still in PR, you can already review the sample on #744
Closing for now, please re-open if you have any questions.
Dear @omri374 & Team,
I found batch anonymization way very useful for anonymizing dataframe with some observations in the form of queries/doubts as mentioned below:
Error at CodeLine: analyzer_results = batch_analyzer.analyze_dict(df_dict, language="xx")
The output file has a different shape as compared to the original file:
CodeLine: scrubbed_df.to_excel('Anonymized_BatchProcessed.xlsx', index=False, header=True)
Looking forward to the solutions.
Thanks @AkshayDube11, we'll look into this.
Thanks @AkshayDube11, we'll look into this.
Hi @omri374, could you help me with solutions for the respective issues I mentioned above so far or any alternatives to work around it.
Hey @AkshayDube11, for the issue with en_core_web_sm
, it's probably because it couldn't find it. Have you installed it in the environment where Presidio runs? spacy -m download en_core_web_sm
If that's not the case, could you please add more details (error message etc.)
Regarding (2), i'm not sure what's the reason for this. Could you please share some more information?
Hey @AkshayDube11, for the issue with
en_core_web_sm
, it's probably because it couldn't find it. Have you installed it in the environment where Presidio runs?spacy -m download en_core_web_sm
If that's not the case, could you please add more details (error message, etc.)Regarding (2), I'm not sure what's the reason for this. Could you please share some more information?
The issue I mentioned in point #1 is when I downloaded "python -m spacy download xx_ent_wiki_sm" a multi-lingual spacy NLP engine for anonymizing my multi-language comments using the following configuration code, post getting configured properly the BatchAnalyzer is throwing error that "spacy -m download en_core_web_sm" English based spacy NLP engine is required to be downloaded to proceed with BatcgAnalyzer engine ahead.
Note: I chose the multi-lingual spacy NLP engine from - https://spacy.io/usage/models#download
Multilingual NLP engine configuration code:
**from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer from presidio_analyzer.nlp_engine import NlpEngineProvider from presidio_anonymizer import AnonymizerEngine from presidio_anonymizer.entities.engine import OperatorConfig configuration = { "nlp_engine_name": "spacy", "models": [{"lang_code": "xx", "model_name": "xx_ent_wiki_sm"}] }
provider = NlpEngineProvider(nlp_configuration=configuration) multi_nlp_engine = provider.create_engine()
analyzer = AnalyzerEngine( nlp_engine=multi_nlp_engine, supported_languages=["xx"] )**
Error at CodeLine: analyzer_results = batch_analyzer.analyze_dict(df_dict, language="xx")
The issue with point #2: suppose my dataframe is of shape (1 column & 1000 rows) having free text/ sentences as row entries. Now post anonymization the shape of the dataframe changes sooner I try to extract the dataframe in the desired format(excel, CSV, etc.). I am trying to find a solution for it, I will post it back sooner I get the desired result.
Also, I need your suggestions to deal with point #3.
For point 1, Presidio requires an explicit language to be defined for each recognizer (or nlp model). The xx model cannot be used as is, since Presidio is expecting each model to define which languages it can support.
It is possible to have the same model loaded multiple times, but the language should be defined per value so if you have a data frames with multiple lines each containing text in a different language, this could be a little tricky.
Here's a short experiment showing how you could call the BatchAnalyzerEngine with multiple languages, but as I mentioned you'd have to set the language for each cell individually which might be a limiting factor.
Set up sample data:
columns = ["name phrase","phone number phrase", "integer", "boolean" ]
sample_data = [
('Morris likes this','Please call 212-555-1234 after 2pm', 1, True),
('You should talk to Mike','his number is 978-428-7111', 2, False),
('Mary had a little startup','Phone number: 202-342-1234', 3, False),
('Italo Ferreira de Brasil hace historia como el primer medallista de oro olímpico de surf', 'Phone number: 202-342-1234', 4, False)
]
df = pd.DataFrame(sample_data,columns=columns)
Run Presidio in batch mode using two languages. Note that the xx_ent_wiki_sm is going to be loaded twice in this case.
#!python -m spacy download xx_ent_wiki_sm
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "xx_ent_wiki_sm"},
{"lang_code": "es", "model_name": "xx_ent_wiki_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
nlp_engine.nlp
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en", "es"])
batch_anonymizer = BatchAnonymizerEngine()
analyzer_results = batch_analyzer.analyze_dict(df_dict, language="es")
analyzer_results = list(analyzer_results)
analyzer_results
One naive option is to split the data frame to rows in each language, and then merge it again after de-identification.
Another naive option is to define the xx_ent_wiki_sm
model as serving English, and then all the input would go through it:
#!python -m spacy download xx_ent_wiki_sm
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "xx_ent_wiki_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine)
batch_anonymizer = BatchAnonymizerEngine()
analyzer_results = batch_analyzer.analyze_dict(df_dict, language="en",return_decision_process=True)
analyzer_results = list(analyzer_results)
analyzer_results
In reality all input would go through it and entities in other languages could be detected as well.
For point (3), for now I would suggest the same approach you've taken, of running Presidio on a subset of the columns and then concatenating it to the other non-sensitive columns.
Dear @omri374,
Please share your view on the observations I made after having batchAnonymization working on my Dataframes having comments/sentences as row entries in multiple languages. I followed the approach you suggested above by defining my model to serve to English Language and making other multi-lingual sentences in my dataframe to pass through it.
Observations:
I found many supporting words to the actual PII-based words to be identified and anonymized by the BatchAnalyzer and BatchAnonymizer Engine respectively making the model yielding very low accuracy for anonymization. For example, words like "Maybe", "I think", etc from the sentences were identified and anonymized.
I made changes to **kwargs(arguments) to be passed to BatchAnlyzer and BatchAnonymizer classes defined to have custom identification to as per my need. For eg: I just needed to have names, emails, and locations to be identified thus I made changes to the code as mentioned below. Do let me know If I should make changes in my existing code, despite the fact the code ran without any errors but the overall accuracy of the model to identify and anonymize PII from my Multi-lingual dataframes is very low, and is incorrect as mentioned in the first observation.
Codelines for BatchAnalyzer:
def connect(**kwargs):
print(kwargs)
config = {'entities': ["EMAIL_ADDRESS", "PERSON", "LOCATION"],
'score_threshold': 0.80}
connect(**config)
@dataclass
class DictAnalyzerResult:
key: str
value: Union[str, List[str]]
recognizer_results: Union[List[RecognizerResult], List[List[RecognizerResult]]]
class BatchAnalyzerEngine(AnalyzerEngine):
def analyze_list(self, list_of_texts: Iterable[str], **kwargs) -> List[List[RecognizerResult]]:
list_results = []
for text in list_of_texts:
results = self.analyze(text=text, **kwargs) if isinstance(text, str) else []
list_results.append(results)
return list_results
def analyze_dict(
self, input_dict: Dict[str, Union[object, Iterable[object]]], **kwargs) -> Iterator[DictAnalyzerResult]:
for key, value in input_dict.items():
if not value:
results = []
else:
if isinstance(value, str):
results: List[RecognizerResult] = self.analyze(text=value, **kwargs)
elif isinstance(value, collections.Iterable):
results: List[List[RecognizerResult]] = self.analyze_list(
list_of_texts=value,
**kwargs)
else:
results = []
yield DictAnalyzerResult(key=key, value=value, recognizer_results=results)
Codelines for BatchAnonymizer:
from presidio_anonymizer.entities.engine import OperatorConfig
def connect(**kwargs):
print(kwargs)
configAnon = {'operators': {"PERSON": OperatorConfig("replace", {"new_value": "<PER>"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
"LOCATION": OperatorConfig("replace", {"new_value": "<LOCATION>"})}}
connect(**configAnon)
class BatchAnonymizerEngine(AnonymizerEngine):
def anonymize_list(
self,
texts:List[str],
recognizer_results_list: List[List[RecognizerResult]],
**kwargs
) -> List[EngineResult]:
return_list = []
for text, recognizer_results in zip(texts, recognizer_results_list):
if isinstance(text,str):
res = self.anonymize(text=text,analyzer_results=recognizer_results,**kwargs)
return_list.append(res.text)
else:
return_list.append(text)
return return_list
def anonymize_dict(self, analyzer_results: Iterator[DictAnalyzerResult],**kwargs) -> Dict[str, str]:
return_dict = {}
for result in analyzer_results:
if isinstance(result.value, str):
resp = self.anonymize(text=result.value, analyzer_results=result.recognizer_results, **kwargs)
return_dict[result.key] = resp.text
elif isinstance(result.value, collections.Iterable):
anonymize_respones = self.anonymize_list(texts=result.value,
recognizer_results_list=result.recognizer_results,
**kwargs)
return_dict[result.key] = anonymize_respones
else:
return_dict[result.key] = result.value
return return_dict
Codelines for Model Configuration and Analyzer and Anonymizer Engine:
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "xx_ent_wiki_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine)
batch_anonymizer = BatchAnonymizerEngine()
analyzer_results = batch_analyzer.analyze_dict(df_dict, language="en",return_decision_process=True, **config)
analyzer_results = list(analyzer_results)
analyzer_results
anonymizer_results = batch_anonymizer.anonymize_dict(analyzer_results, **configAnon)
Hi @omri374 & Team,
Please share your thoughts and suggestions on the details mentioned above related to the lower accuracy and wrong detection of PII entities by the multilingual NLP model used for the BatchAnalyzer engine. Looking forward to it.
Hi @AkshayDube11, is the code you provide in your issue for BatchAnalyzerEngine
and BatchAnonymizerEngine
identical to the one in our sample?
Could you give some examples of mistakes the model makes? It could be that the model itself isn't good enough, or that potentially additional PII recognizers are needed to the support it. The BatchAnalyzerEngine
loads all the default PII recognizers. Maybe you don't need all of them?
Dear @omri374 ,
Yes, the code for BatchAnalyzer and BatchAnonymizer class is the same as it was in the sample given with just an addition to defining what I prefer to have as **kwargs for each of them.
I won't be able to share the data wherein the BatchAnalyzerEngine has wrongly identified PII words. To give you a reference of it identifying wrong words for identification are in multilingual sentences it picks up words like "vielleicht"()Maybe), "Ich glaube"(I think) to get anonymized ahead. This wrong identification is happening with many such wrong PII words resulting in lower accuracy for anonymization. Looking forward to hearing your thoughts on where am I making incorrections in having a multilingual anonymization model in place for dataframes.
Is it possible to know the language of each record? If yes, you could set up Presidio to use different a different language model for each language. If there isn't a way, maybe a language identification model could help? My intuition is that PII detection might not be very accurate with multilingual models, but it's hard to say without proper analysis.
Dear @omri374,
As of now, my analysis is on dealing with text in the following mentioned languages; Czech, Slovak, Hungarian, German, English, Swedish. What are your thoughts on using "xx_sent_ud_sm" which is more accurate as compared to "xx_ent_wiki_sm" multilingual spacy model as mentioned on https://spacy.io/usage/models#download
If it is more accurate, if possible do share a test case to implement presidio using this more accurate Spacy model.
Hi @AkshayDube11, the implementation of a new spaCy model is similar to what you already have in the code:
configuration = {
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "xx_sent_ud_sm"}],
}
nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine)
batch_anonymizer = BatchAnonymizerEngine()
Hi @omri374, thank you for the sample code as mentioned. I will implement it and will revert back with the findings.
Also, please share your thoughts on the concern related to the accuracy of the model regarding identifying the wrong PII entities from the languages I mentioned in the above message from the trail.
In very high level, the process for improving the PII detection rate with Presidio is the following:
Hope this helps!
Closing for now, feel free to re-open if you have any more questions or issues.
Hi Everyone,
Note: I am writing on this thread as my queries are in line with it.
I have two queries in continuation with the above-mentioned discussion:
To have a better reference to the questions above, my code snippet is as mentioned below:
step1: I converted my excel format dataframe column to a list with Dtype of string. Code: df1 = str(df.to_dict(orient="list"))
step2: I tried using a multilingual model as per my used case and I successfully got it working but I need a custom score_thresold setting as some of the Person Names are getting missed or wrong names are getting anonymized by the presidio anonymizer. Analyzer Code: analyzer_results = analyzer.analyze(text=df1, language='xx',entities=["EMAIL_ADDRESS", "PERSON", "LOCATION"], score_threshold=0.90)
print(analyzer_results)
Anonymizer Code: anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize( text=df1, analyzer_results=analyzer_results,"}),
"EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": ""}),
"LOCATION": OperatorConfig("replace", {"new_value": ""})})
operators={"PERSON": OperatorConfig("replace", {"new_value": "
Originally posted by @AkshayDube11 in https://github.com/microsoft/presidio/issues/600#issuecomment-876590760