Open RobDickinson opened 1 month ago
The first case turns out to be easy to solve with ignoredPatterns
, since Unix timestamps will always be 13 digits long and have a specific preamble.
CreditCard x = new CreditCard();
x.setIgnoredPatterns(List.of(new IgnoredPattern("1[5-8][0-9]{11}"))); // ignore unix timestamps
With ignoredPatterns
set, Phileas still identifies these spans but does not apply them:
String value = "{ \"valid_until_millis\":\"1647725122227\" }";
FilterResponse fr = r.filter(value);
expect(fr.explanation().appliedSpans().size()).toEqual(0);
expect(fr.explanation().identifiedSpans().size()).toEqual(1);
expect(fr.explanation().identifiedSpans().get(0).getConfidence()).toEqual(0.9);
expect(fr.explanation().identifiedSpans().get(0).getFilterType().toString()).toEqual("credit-card");
expect(fr.explanation().identifiedSpans().get(0).getText()).toEqual("1647725122227");
expect(fr.filteredText()).toEqual(value);
👆 no changes to Phileas required to solve this first part
That is really awesome. Do you think it be beneficial to include that ignored pattern as an option in the filter profile just to keep the user from having to set it manually? There could be a boolean on CreditCard
called ignoreUnixTimestamps
and when true
it checks the credit card against that pattern.
Well, I'm applying this ignoredPattern
in multiple places already -- so if Phileas provided an option like that, I'd definitely use it. Beyond the reuse aspect, seems like a nice improvement to what Phileas understands about credit cards, for little new code 🤔
I agree. Wrote #130 to capture it separate from this issue.
The changes proposed in 129-credit-card-dashes
will wrap up the rest of this one
Sorry this turned out to be a multi-part issue, I'll try to keep things more atomic ⚛️
I'm using Phileas to redact logging data, and see two interesting patterns that result in false positives on credit cards.
What is interesting is that LUHN checks (while certainly helpful) do not appear to be sufficient to prevent all cases where random data can leak through. (~5% of UUID or timestamp fields may contain valid LUHNs)
The solution to the first case could be reducing credit card confidence if the matched value is in an expected range (like timestamps over the last year and 3 months into the future). I haven't done the math but seems like that's a small number of values with valid LUHN checksums to exclude if we're considering a reasonably small time range.
The solution to the second case could be reducing credit card confidence when the match is found within the context of a larger string. Confidence in phone numbers is reduced if the phone number is embedded within a larger string, and we've found this extremely helpful in eliminating false positives. It would be very helpful if credit card filtering had a similar behavior.
Unfortunately there is no obvious/easy workaround, but seems like improved confidence estimation for credit cards would be generally useful (since detecting and redacting credit cards is a universal requirement for PII engines)