Incorporation Text Anonymization Benchmark (TAB)

Recently as new paper released a new -

Text Anonymization Benchmark (TAB), a brand-new dataset containing 1268 court cases from the European Court of Human Rights (ECHR) enriched with detailed annotations regarding the personal information mentioned in each document, including semantic categories, identifier types, confidential attributes, and co-reference relations. Compared to previous work, the TAB corpus is designed to go beyond traditional de-identification (which is limited to the detection of predefined semantic categories), and explicitly marks which text spans ought to be masked in order to conceal the identity of the person to be protected.

It would be nice to add support as this takes care of co-reference relations. Not sure where this fits but creating an issue so we can discuss it.

Code: https://github.com/NorskRegnesentral/text-anonymisation-benchmark

microsoft / presidio

Incorporation Text Anonymization Benchmark (TAB) #823