Closed AAnirudh07 closed 10 months ago
Hi @AAnirudh07 You can do it simply by using below modified logic. You can further customize as per your need.
from presidio_analyzer import AnalyzerEngine, PatternRecognizer
from presidio_anonymizer import AnonymizerEngine
from presidio_anonymizer.entities import OperatorConfig
text_to_anonymize = "Jill, James, and Jack they lives in Pune and Mumbai"
names_recognizer = PatternRecognizer(supported_entity="NAME",
deny_list=["Jill", "James", "Jack"])
analyzer = AnalyzerEngine()
analyzer.registry.add_recognizer(names_recognizer)
analyzer_results = analyzer.analyze(text=text_to_anonymize, entities=["NAME", "LOCATION"], language="en")
# Finding frequency of each entity type
df = [r.to_dict() for r in analyzer_results]
frequency_dict = {}
for item in df:
entity_type = item['entity_type']
if entity_type in frequency_dict:
frequency_dict[entity_type] += 1
else:
frequency_dict[entity_type] = 1
print("Frequency of each entity type: ",frequency_dict)
# Increasing each frequency by 1
for key in frequency_dict:
frequency_dict[key] += 1
starting_ids = frequency_dict
print("Modified frequency of each entity type", starting_ids)
# Custom anonymization function
def anonymize(text, et):
if text == "PII": # Presidio passes the string 'PII' to check if the function returns a string
return "PII"
global starting_ids
starting_ids[et] -= 1
return f"<{et}_{starting_ids[et]}>"
# Define operators for anonymization for each entity type
operators = {}
for entity_type, count in frequency_dict.items():
operators[entity_type] = OperatorConfig("custom", {"lambda": lambda text, et=entity_type: anonymize(text, et)})
anonymizer = AnonymizerEngine()
anonymized_results = anonymizer.anonymize(
text=text_to_anonymize, analyzer_results=analyzer_results, operators=operators
)
print(anonymized_results.text)
You will get output like below, as per your requirements:
Frequency of each entity type: {'NAME': 3, 'LOCATION': 2}
Modified frequency of each entity type {'NAME': 4, 'LOCATION': 3}
<NAME_1>, <NAME_2>, and <NAME_3> they lives in <LOCATION_1> and <LOCATION_2>
this is great, ty!
Hello,
I've noticed that Presidio processes identified PII in reverse. I have a use-case where I need to label PII with an identifier of the form. However, with this setup, the PII that appears first gets the largest ID. Here's a simple example:
This outputs:
Is there a way to get PII in the same order as they appear in the text? Thank you!