word boundaries - Githubissues

Hi @Cherchercher. Overlaps make sense as we are running multiple recognizers independently.

Some things to consider in this case:

Use a threshold to ignore identified PII with low score
Come up with a logic to handle these overlaps. For example, this code (on our research repo) handles overlaps in a naive way: https://github.com/microsoft/presidio-research/blob/65f4239cd41360b362252fbd0557231a23e52fc3/presidio_evaluator/span_to_tag.py#L60 It looks at each token's prediction score to come up with a non-overlapping representation. Perhaps the unit tests would give a better hunch on how this works : https://github.com/microsoft/presidio-research/blob/65f4239cd41360b362252fbd0557231a23e52fc3/tests/test_span_to_tag.py#L160

The demo (and Presidio itself) do not handle overlaps as such logic should be owned by the application and not by the engine, to keep it as generic as possible.

microsoft / presidio

word boundaries #254