microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.54k stars 547 forks source link

Incorporation Text Anonymization Benchmark (TAB) #823

Open lalitpagaria opened 2 years ago

lalitpagaria commented 2 years ago

Recently as new paper released a new -

Text Anonymization Benchmark (TAB), a brand-new dataset containing 1268 court cases from the European Court of Human Rights (ECHR) enriched with detailed annotations regarding the personal information mentioned in each document, including semantic categories, identifier types, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected.

It would be nice to add support as this takes care of co-reference relations. Not sure where this fits but creating an issue so we can discuss it.

Code: https://github.com/NorskRegnesentral/text-anonymisation-benchmark

omri374 commented 2 years ago

Thanks! This is really interesting. We will look into integrating it perhaps in the Presidio research repo.