philterd / phileas

The open source PII and PHI redaction engine
https://www.philterd.ai
Apache License 2.0
22 stars 4 forks source link

Improve confidence estimation for credit card numbers #120

Open RobDickinson opened 1 month ago

RobDickinson commented 1 month ago

I'm using Phileas to redact logging data, and see two interesting patterns that result in false positives on credit cards.

{ "data": {"quote_token":"null", "time_processed":"1647725122146" }}
👆 fails LUHN check and is ignored by default (which is good!)

Result from System.currentTimeMillis masked as credit card: (confidence = 0.9)
{ "quote_token":"...", "time_processed":"1647725122227" }
{ "quote_token":"...", "time_processed":"*************" }

Portions of Java UUID masked as credit card: (confidence = 0.9)
{ "query":" { quote(account_token:\"47223179-9330-4259-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}
{ "query":" { quote(account_token:\"******************-b66c-f2db26efb20c\", amount_usd:\"62\", coin_type:\"BTC\" )}"}

What is interesting is that LUHN checks (while certainly helpful) do not appear to be sufficient to prevent all cases where random data can leak through. (~5% of UUID or timestamp fields may contain valid LUHNs)

The solution to the first case could be reducing credit card confidence if the matched value is in an expected range (like timestamps over the last year and 3 months into the future). I haven't done the math but seems like that's a small number of values with valid LUHN checksums to exclude if we're considering a reasonably small time range.

The solution to the second case could be reducing credit card confidence when the match is found within the context of a larger string. Confidence in phone numbers is reduced if the phone number is embedded within a larger string, and we've found this extremely helpful in eliminating false positives. It would be very helpful if credit card filtering had a similar behavior.

Unfortunately there is no obvious/easy workaround, but seems like improved confidence estimation for credit cards would be generally useful (since detecting and redacting credit cards is a universal requirement for PII engines)

RobDickinson commented 1 month ago

The first case turns out to be easy to solve with ignoredPatterns, since Unix timestamps will always be 13 digits long and have a specific preamble.

CreditCard x = new CreditCard();
x.setIgnoredPatterns(List.of(new IgnoredPattern("1[5-8][0-9]{11}")));  // ignore unix timestamps

With ignoredPatterns set, Phileas still identifies these spans but does not apply them:

String value = "{ \"valid_until_millis\":\"1647725122227\" }";
FilterResponse fr = r.filter(value);
expect(fr.explanation().appliedSpans().size()).toEqual(0);
expect(fr.explanation().identifiedSpans().size()).toEqual(1);
expect(fr.explanation().identifiedSpans().get(0).getConfidence()).toEqual(0.9);
expect(fr.explanation().identifiedSpans().get(0).getFilterType().toString()).toEqual("credit-card");
expect(fr.explanation().identifiedSpans().get(0).getText()).toEqual("1647725122227");
expect(fr.filteredText()).toEqual(value);

👆 no changes to Phileas required to solve this first part

jzonthemtn commented 1 month ago

That is really awesome. Do you think it be beneficial to include that ignored pattern as an option in the filter profile just to keep the user from having to set it manually? There could be a boolean on CreditCard called ignoreUnixTimestamps and when true it checks the credit card against that pattern.

RobDickinson commented 4 weeks ago

Well, I'm applying this ignoredPattern in multiple places already -- so if Phileas provided an option like that, I'd definitely use it. Beyond the reuse aspect, seems like a nice improvement to what Phileas understands about credit cards, for little new code 🤔

jzonthemtn commented 4 weeks ago

I agree. Wrote #130 to capture it separate from this issue.

RobDickinson commented 2 weeks ago

The changes proposed in 129-credit-card-dashes will wrap up the rest of this one

Sorry this turned out to be a multi-part issue, I'll try to keep things more atomic ⚛️