microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.92k stars 580 forks source link

Reversed PII order #1276

Closed AAnirudh07 closed 10 months ago

AAnirudh07 commented 10 months ago

Hello,

I've noticed that Presidio processes identified PII in reverse. I have a use-case where I need to label PII with an identifier of the form . However, with this setup, the PII that appears first gets the largest ID. Here's a simple example:

from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

text_to_anonymize = "Jill, James, and Jack."
names_recognizer = PatternRecognizer(supported_entity="NAME",
deny_list=["Jill", "James", "Jack"])
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(names_recognizer)
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["NAME"], language="en")

# Create custom operator for the PERSON entity
id = 0
def anonymize(text):
    if text == "PII":  # Presidio passes the string 'PII' to check if the function returns a string
        return "PII"
    global id
    id += 1
    return f"<NAME_{id}>"
operators = {"NAME": OperatorConfig("custom", {"lambda": anonymize})}

anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize(
    text=text_to_anonymize, analyzer_results=analyzer_results, operators=operators
)
print(anonymized_results.text)

This outputs:

<NAME_3>, <NAME_2>, and <NAME_1>.

Is there a way to get PII in the same order as they appear in the text? Thank you!

VMD7 commented 10 months ago

Hi @AAnirudh07 You can do it simply by using below modified logic. You can further customize as per your need.

from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig

text_to_anonymize = "Jill, James, and Jack they lives in Pune and Mumbai"
names_recognizer = PatternRecognizer(supported_entity="NAME",
deny_list=["Jill", "James", "Jack"])
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(names_recognizer)
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["NAME", "LOCATION"], language="en")

# Finding frequency of each entity type
df = [r.to_dict() for r in analyzer_results]
frequency_dict = {}
for item in df:
    entity_type = item['entity_type']
    if entity_type in frequency_dict:
        frequency_dict[entity_type] += 1
    else:
        frequency_dict[entity_type] = 1
print("Frequency of each entity type: ",frequency_dict)

# Increasing each frequency by 1 
for key in frequency_dict:
    frequency_dict[key] += 1
starting_ids  = frequency_dict
print("Modified frequency of each entity type", starting_ids)

# Custom anonymization function
def anonymize(text, et):
    if text == "PII":  # Presidio passes the string 'PII' to check if the function returns a string
        return "PII"
    global starting_ids 
    starting_ids[et] -= 1
    return f"<{et}_{starting_ids[et]}>"

# Define operators for anonymization for each entity type
operators = {}
for entity_type, count in frequency_dict.items():
    operators[entity_type] = OperatorConfig("custom", {"lambda": lambda text, et=entity_type: anonymize(text, et)})

anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize(
    text=text_to_anonymize, analyzer_results=analyzer_results, operators=operators
)
print(anonymized_results.text)

You will get output like below, as per your requirements:

Frequency of each entity type:  {'NAME': 3, 'LOCATION': 2}
Modified frequency of each entity type {'NAME': 4, 'LOCATION': 3}
<NAME_1>, <NAME_2>, and <NAME_3> they lives in <LOCATION_1> and <LOCATION_2>
AAnirudh07 commented 10 months ago

this is great, ty!