microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.63k stars 554 forks source link

Need Guidance! How to make presidio work with DataFrames with multiple columns having free text(with PII). #742

Closed AkshayDube11 closed 3 years ago

AkshayDube11 commented 3 years ago

Hi Everyone,

Note: I am writing on this thread as my queries are in line with it.

I have two queries in continuation with the above-mentioned discussion:

  1. How to use presidio with columns of dataframes having free text, also with a dataframe as a whole? My primary objective.
  2. Can we set score_thresholds for different recognizers used? For Eg: How to have separate score_thresold values for entities such as PERSON, Location, etc. while pushing the free text through the analyzer step.

To have a better reference to the questions above, my code snippet is as mentioned below:

step1: I converted my excel format dataframe column to a list with Dtype of string. Code: df1 = str(df.to_dict(orient="list"))

step2: I tried using a multilingual model as per my used case and I successfully got it working but I need a custom score_thresold setting as some of the Person Names are getting missed or wrong names are getting anonymized by the presidio anonymizer. Analyzer Code: analyzer_results = analyzer.analyze(text=df1, language='xx',entities=["EMAIL_ADDRESS", "PERSON", "LOCATION"], score_threshold=0.90)

print(analyzer_results)

Anonymizer Code: anonymizer = AnonymizerEngine()

anonymized_results = anonymizer.anonymize( text=df1, analyzer_results=analyzer_results,
operators={"PERSON": OperatorConfig("replace", {"new_value": ""}), "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": ""}), "LOCATION": OperatorConfig("replace", {"new_value": ""})})

Originally posted by @AkshayDube11 in https://github.com/microsoft/presidio/issues/600#issuecomment-876590760

omri374 commented 3 years ago

Hi @AkshayDube11, we'll soon post a sample on how Presidio can be applied on structured data, such as Pandas DataFrames.

omri374 commented 3 years ago

Hi @AkshayDube11, while still in PR, you can already review the sample on #744

omri374 commented 3 years ago

Link to sample: https://github.com/microsoft/presidio/blob/main/docs/samples/python/batch_processing.ipynb

omri374 commented 3 years ago

Closing for now, please re-open if you have any questions.

AkshayDube11 commented 3 years ago

Dear @omri374 & Team,

I found batch anonymization way very useful for anonymizing dataframe with some observations in the form of queries/doubts as mentioned below:

  1. I tried to configure a multilingual NLP engine "xx_ent_wiki_sm". It got configured but the BatchAnalyzer engine is throwing errors and is asking to install "en_core_web_sm" to anonymize my dataframe.

Error at CodeLine: analyzer_results = batch_analyzer.analyze_dict(df_dict, language="xx")

  1. When I try to push the anonymized dataframe into the desired file format, the shape of the file format(number of rows in respective anonymized columns) changes if a particular row has a large sentence in it.

The output file has a different shape as compared to the original file:

CodeLine: scrubbed_df.to_excel('Anonymized_BatchProcessed.xlsx', index=False, header=True)

  1. I am finding it difficult to anonymize few columns from the entire dataframe. What I did is I took the subset of the original dataframe and anonymized the columns I want from in the subset and then I updated the original dataframe with the anonymized columns of the subset. Please suggest If any optimal way to accomplish it.

Looking forward to the solutions.

omri374 commented 3 years ago

Thanks @AkshayDube11, we'll look into this.

AkshayDube11 commented 3 years ago

Thanks @AkshayDube11, we'll look into this.

Hi @omri374, could you help me with solutions for the respective issues I mentioned above so far or any alternatives to work around it.

omri374 commented 3 years ago

Hey @AkshayDube11, for the issue with en_core_web_sm, it's probably because it couldn't find it. Have you installed it in the environment where Presidio runs? spacy -m download en_core_web_sm If that's not the case, could you please add more details (error message etc.)

Regarding (2), i'm not sure what's the reason for this. Could you please share some more information?

AkshayDube11 commented 3 years ago

Hey @AkshayDube11, for the issue with en_core_web_sm, it's probably because it couldn't find it. Have you installed it in the environment where Presidio runs? spacy -m download en_core_web_sm If that's not the case, could you please add more details (error message, etc.)

Regarding (2), I'm not sure what's the reason for this. Could you please share some more information?

The issue I mentioned in point #1 is when I downloaded "python -m spacy download xx_ent_wiki_sm" a multi-lingual spacy NLP engine for anonymizing my multi-language comments using the following configuration code, post getting configured properly the BatchAnalyzer is throwing error that "spacy -m download en_core_web_sm" English based spacy NLP engine is required to be downloaded to proceed with BatcgAnalyzer engine ahead.

Note: I chose the multi-lingual spacy NLP engine from - https://spacy.io/usage/models#download

Multilingual NLP engine configuration code:

**from presidio_analyzer import AnalyzerEngine, RecognizerRegistry, PatternRecognizer from presidio_analyzer.nlp_engine import NlpEngineProvider from presidio_anonymizer import AnonymizerEngine from presidio_anonymizer.entities.engine import OperatorConfig configuration = { "nlp_engine_name": "spacy", "models": [{"lang_code": "xx", "model_name": "xx_ent_wiki_sm"}] }

provider = NlpEngineProvider(nlp_configuration=configuration) multi_nlp_engine = provider.create_engine()

analyzer = AnalyzerEngine( nlp_engine=multi_nlp_engine, supported_languages=["xx"] )**

Error at CodeLine: analyzer_results = batch_analyzer.analyze_dict(df_dict, language="xx")

The issue with point #2: suppose my dataframe is of shape (1 column & 1000 rows) having free text/ sentences as row entries. Now post anonymization the shape of the dataframe changes sooner I try to extract the dataframe in the desired format(excel, CSV, etc.). I am trying to find a solution for it, I will post it back sooner I get the desired result.

Also, I need your suggestions to deal with point #3.

omri374 commented 3 years ago

For point 1, Presidio requires an explicit language to be defined for each recognizer (or nlp model). The xx model cannot be used as is, since Presidio is expecting each model to define which languages it can support.

It is possible to have the same model loaded multiple times, but the language should be defined per value so if you have a data frames with multiple lines each containing text in a different language, this could be a little tricky.

Here's a short experiment showing how you could call the BatchAnalyzerEngine with multiple languages, but as I mentioned you'd have to set the language for each cell individually which might be a limiting factor.

Set up sample data:

columns = ["name phrase","phone number phrase", "integer", "boolean" ]
sample_data = [
        ('Morris likes this','Please call 212-555-1234 after 2pm', 1, True),
        ('You should talk to Mike','his number is 978-428-7111', 2, False),
        ('Mary had a little startup','Phone number: 202-342-1234', 3, False),
        ('Italo Ferreira de Brasil hace historia como el primer medallista de oro olímpico de surf', 'Phone number: 202-342-1234', 4, False)
]
df  = pd.DataFrame(sample_data,columns=columns)

Run Presidio in batch mode using two languages. Note that the xx_ent_wiki_sm is going to be loaded twice in this case.

#!python -m spacy download xx_ent_wiki_sm

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "xx_ent_wiki_sm"},
              {"lang_code": "es", "model_name": "xx_ent_wiki_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
nlp_engine.nlp

batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine, supported_languages=["en", "es"])
batch_anonymizer = BatchAnonymizerEngine()

analyzer_results = batch_analyzer.analyze_dict(df_dict, language="es")
analyzer_results = list(analyzer_results)
analyzer_results

One naive option is to split the data frame to rows in each language, and then merge it again after de-identification. Another naive option is to define the xx_ent_wiki_sm model as serving English, and then all the input would go through it:

#!python -m spacy download xx_ent_wiki_sm

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "xx_ent_wiki_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine)
batch_anonymizer = BatchAnonymizerEngine()

analyzer_results = batch_analyzer.analyze_dict(df_dict, language="en",return_decision_process=True)
analyzer_results = list(analyzer_results)
analyzer_results

In reality all input would go through it and entities in other languages could be detected as well.

omri374 commented 3 years ago

For point (3), for now I would suggest the same approach you've taken, of running Presidio on a subset of the columns and then concatenating it to the other non-sensitive columns.

AkshayDube11 commented 3 years ago

Dear @omri374,

Please share your view on the observations I made after having batchAnonymization working on my Dataframes having comments/sentences as row entries in multiple languages. I followed the approach you suggested above by defining my model to serve to English Language and making other multi-lingual sentences in my dataframe to pass through it.

Observations:

  1. I found many supporting words to the actual PII-based words to be identified and anonymized by the BatchAnalyzer and BatchAnonymizer Engine respectively making the model yielding very low accuracy for anonymization. For example, words like "Maybe", "I think", etc from the sentences were identified and anonymized.

  2. I made changes to **kwargs(arguments) to be passed to BatchAnlyzer and BatchAnonymizer classes defined to have custom identification to as per my need. For eg: I just needed to have names, emails, and locations to be identified thus I made changes to the code as mentioned below. Do let me know If I should make changes in my existing code, despite the fact the code ran without any errors but the overall accuracy of the model to identify and anonymize PII from my Multi-lingual dataframes is very low, and is incorrect as mentioned in the first observation.

Codelines for BatchAnalyzer:

def connect(**kwargs):
    print(kwargs)

config = {'entities': ["EMAIL_ADDRESS", "PERSON", "LOCATION"],
          'score_threshold': 0.80}

connect(**config)

@dataclass
class DictAnalyzerResult:

    key: str
    value: Union[str, List[str]]
    recognizer_results: Union[List[RecognizerResult], List[List[RecognizerResult]]]

class BatchAnalyzerEngine(AnalyzerEngine):

    def analyze_list(self, list_of_texts: Iterable[str], **kwargs) -> List[List[RecognizerResult]]:

        list_results = []
        for text in list_of_texts:
            results = self.analyze(text=text, **kwargs) if isinstance(text, str) else []
            list_results.append(results)
        return list_results

    def analyze_dict(
     self, input_dict: Dict[str, Union[object, Iterable[object]]], **kwargs) -> Iterator[DictAnalyzerResult]:

        for key, value in input_dict.items():
            if not value:
                results = []
            else:
                if isinstance(value, str):
                    results: List[RecognizerResult] = self.analyze(text=value, **kwargs)
                elif isinstance(value, collections.Iterable):
                    results: List[List[RecognizerResult]] = self.analyze_list(
                                list_of_texts=value, 
                                **kwargs)
                else:
                    results = []
            yield DictAnalyzerResult(key=key, value=value, recognizer_results=results)

Codelines for BatchAnonymizer:

from presidio_anonymizer.entities.engine import OperatorConfig
def connect(**kwargs):
    print(kwargs)

configAnon = {'operators': {"PERSON": OperatorConfig("replace", {"new_value": "<PER>"}),
               "EMAIL_ADDRESS": OperatorConfig("replace", {"new_value": "<EMAIL>"}),
               "LOCATION": OperatorConfig("replace", {"new_value": "<LOCATION>"})}}

connect(**configAnon)

Presidio Anonymizer

class BatchAnonymizerEngine(AnonymizerEngine):

    def anonymize_list(
        self, 
        texts:List[str], 
        recognizer_results_list: List[List[RecognizerResult]], 
        **kwargs
    ) -> List[EngineResult]:

        return_list = []
        for text, recognizer_results in zip(texts, recognizer_results_list):
            if isinstance(text,str):
                res = self.anonymize(text=text,analyzer_results=recognizer_results,**kwargs)
                return_list.append(res.text)
            else:
                return_list.append(text)

        return return_list

    def anonymize_dict(self, analyzer_results: Iterator[DictAnalyzerResult],**kwargs) -> Dict[str, str]:

        return_dict = {}
        for result in analyzer_results:
            if isinstance(result.value, str):
                resp = self.anonymize(text=result.value, analyzer_results=result.recognizer_results, **kwargs)
                return_dict[result.key] = resp.text
            elif isinstance(result.value, collections.Iterable):
                anonymize_respones = self.anonymize_list(texts=result.value,
                                                         recognizer_results_list=result.recognizer_results, 
                                                         **kwargs)
                return_dict[result.key] = anonymize_respones 
            else:
                return_dict[result.key] = result.value

        return return_dict

Codelines for Model Configuration and Analyzer and Anonymizer Engine:

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "xx_ent_wiki_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine)
batch_anonymizer = BatchAnonymizerEngine()

analyzer_results = batch_analyzer.analyze_dict(df_dict, language="en",return_decision_process=True, **config)
analyzer_results = list(analyzer_results)
analyzer_results

anonymizer_results = batch_anonymizer.anonymize_dict(analyzer_results, **configAnon)
AkshayDube11 commented 3 years ago

Hi @omri374 & Team,

Please share your thoughts and suggestions on the details mentioned above related to the lower accuracy and wrong detection of PII entities by the multilingual NLP model used for the BatchAnalyzer engine. Looking forward to it.

omri374 commented 3 years ago

Hi @AkshayDube11, is the code you provide in your issue for BatchAnalyzerEngine and BatchAnonymizerEngine identical to the one in our sample?

Could you give some examples of mistakes the model makes? It could be that the model itself isn't good enough, or that potentially additional PII recognizers are needed to the support it. The BatchAnalyzerEngine loads all the default PII recognizers. Maybe you don't need all of them?

AkshayDube11 commented 3 years ago

Dear @omri374 ,

Yes, the code for BatchAnalyzer and BatchAnonymizer class is the same as it was in the sample given with just an addition to defining what I prefer to have as **kwargs for each of them.

I won't be able to share the data wherein the BatchAnalyzerEngine has wrongly identified PII words. To give you a reference of it identifying wrong words for identification are in multilingual sentences it picks up words like "vielleicht"()Maybe), "Ich glaube"(I think) to get anonymized ahead. This wrong identification is happening with many such wrong PII words resulting in lower accuracy for anonymization. Looking forward to hearing your thoughts on where am I making incorrections in having a multilingual anonymization model in place for dataframes.

omri374 commented 3 years ago

Is it possible to know the language of each record? If yes, you could set up Presidio to use different a different language model for each language. If there isn't a way, maybe a language identification model could help? My intuition is that PII detection might not be very accurate with multilingual models, but it's hard to say without proper analysis.

AkshayDube11 commented 3 years ago

Dear @omri374,

As of now, my analysis is on dealing with text in the following mentioned languages; Czech, Slovak, Hungarian, German, English, Swedish. What are your thoughts on using "xx_sent_ud_sm" which is more accurate as compared to "xx_ent_wiki_sm" multilingual spacy model as mentioned on https://spacy.io/usage/models#download

If it is more accurate, if possible do share a test case to implement presidio using this more accurate Spacy model.

omri374 commented 3 years ago

Hi @AkshayDube11, the implementation of a new spaCy model is similar to what you already have in the code:

configuration = {
    "nlp_engine_name": "spacy",
    "models": [{"lang_code": "en", "model_name": "xx_sent_ud_sm"}],
}

nlp_engine = NlpEngineProvider(nlp_configuration=configuration).create_engine()
batch_analyzer = BatchAnalyzerEngine(nlp_engine=nlp_engine)
batch_anonymizer = BatchAnonymizerEngine()
AkshayDube11 commented 3 years ago

Hi @omri374, thank you for the sample code as mentioned. I will implement it and will revert back with the findings.

Also, please share your thoughts on the concern related to the accuracy of the model regarding identifying the wrong PII entities from the languages I mentioned in the above message from the trail.

omri374 commented 3 years ago

In very high level, the process for improving the PII detection rate with Presidio is the following:

  1. Come up with a dataset with labels for PII entities to be used for evaluation.
  2. Decide on an evaluation metric. In presidio-research we use F2.5 or F2 which give more weight to recall over precision.
  3. Run and evaluate the vanilla Presidio (with the default spaCy model or with another spaCy model if more languages are required).
  4. Calculate metrics and analyze false positive or false negative examples. See this notebook for example.
  5. Decide whether results could be improved using rule-based approaches. Some examples:
    • Remove existing recognizers.
    • Modify existing recognizers for logic, patterns, confidence, which entities are being detected by which recognizer.
    • Add new logic (regex, deny-list, rule-based logic) which specifically addresses the entities in hand. See this doc to get started and this one for best practices.
    • Experiment with different configurations of these recognizers and how they should play together.
    • Replace the underlying spaCy/stanza model with another one. See more information here.
    • Connect to a 3rd party service like Text Analytics PII.
    • Add a new recognizer leveraging a ML model from a framework other than spaCy or Stanza (like flair).
  6. Decide if model fine-tuning/training is necessary. If yes, either train/fine-tune a spaCy/stanza model, or use some other framework like transformers, flair or CRF.
  7. Additional options for improvement:
    • Collect and label more data.
    • Create more synthetic data (e.g. the presidio-research data generator).
    • Train a model using weak supervision.
    • Separate entity detection into different models, each detecting a one entity (see this paper for inspiration).
    • Evaluate different pre-trained and transfer learning techniques.
    • In structured/semi-structured data settings, the actual entity value could appear elsewhere in the data. In that case, ad-hoc recognizers could be used to automatically create deny-lists of entity values. For example, if a row contains first name, last name and free text, one could create an ad-hoc recognizer with the first and last names as an ad-hoc deny-list, and these would automatically be identified as PII if they exist in the free text.

Hope this helps!

omri374 commented 3 years ago

Closing for now, feel free to re-open if you have any more questions or issues.