microsoft / presidio

Context aware, pluggable and customizable data protection and de-identification SDK for text and images
https://microsoft.github.io/presidio
MIT License
3.87k stars 578 forks source link

Detecting :: as IPv6 Address #1395

Open troy256 opened 5 months ago

troy256 commented 5 months ago

Describe the bug We are indirectly using this library as part of PII detection for text coming from a GenAI based coding assistant. However it is detecting every instance of "::" as PII, because IPv6 addresses can contain this. This string is regularly used in Perl, as well as C++ and PHP. E.g. -

use strict;
use warnings;
use LWP::UserAgent; <-- detected as PII

Expected behavior Ideally the IPv6 detection would be smart enough to know the difference between programming language use vs an actual IPv6 address.

Additional context Very similar to Issue #907

SharonHart commented 5 months ago

:: is a valid ipv6 address The solution might be to split the regex into two and drop the score for :: Anyone up for fixing it?

troy256 commented 5 months ago

Even though :: is a valid IPv6 address, it's not personally identifiable and is effectively anonymous. So maybe skip over it?

omri374 commented 5 months ago

A simple workaround would be to add :: as an allow_list term.

troy256 commented 5 months ago

Can that be done with configuration or is that a code change?

omri374 commented 5 months ago

Configuration: https://microsoft.github.io/presidio/tutorial/13_allow_list/