solvvy / redact-pii

Remove personally identifiable information from text.
MIT License
187 stars 56 forks source link

Fix Google DLP redactor bug with overlapping and repeating tokens #31

Closed JaredBett closed 5 years ago

JaredBett commented 5 years ago

also update some dependencies to get security fixes

[ch31679] [ch29498] [ch27555]

geekflyer commented 5 years ago

just one more thought: I think the current code might not fully handle the edge case where finding B is fully contained in the range of finding A.

E.g. Imagine the hypothetical example John Francis Doe which may yield FULL_NAME as finding for the full range and MIDDLE_NAME for Francis.

JaredBett commented 5 years ago

That case should work fine. I just added a test to show. The ranges are sorted by start so the middle name finding would come after the full name. It will be considered an overlap because the start of the middle name finding would be less than the end of the full name. If the likelihoods are the same then it will just stick with the full name finding (because it had the lower starting index). Otherwise if the middle name has higher likelihood then it will keep that finding and remove the full name.

geekflyer commented 5 years ago

ok makes sense.

Your comment just gave rise to another minor concern: What if exactly this case above happens and the middle name has the higher likelyhood.

Will the end of the redaction result be: John MIDDLE_NAME Doe?

JaredBett commented 5 years ago

Correct, it would be John MIDDLE_NAME Doe which I think is probably the best. I added another test to show this. Basically if any 2 findings overlap it will always prefer the one with the higher likelihood.

geekflyer commented 5 years ago

Yeah well, my main concern is that if John Francis Doe get's redacted to John MIDDLE_NAME Francis the redaction technically is incomplete, i.e. there's PII left.

It's probably a very rare edge case so I'm fine merging this for now, but I wanted to just bring it up and see what you think

JaredBett commented 5 years ago

True, but that would only be if the middle name had a higher likelihood. In the overlapping name cases I saw the likelihoods were all the same for the different name patterns. Plus the overlapping tokens in general I think is rare other than maybe for this name situation which we appear to handle correctly for real-world scenarios.

I think we'd have to change how the token replacements work in order to maybe get complete coverage. Right now it just does a string replacement one finding at a time based on the original PII text. But once we start replacing tokens with overlapping locations, then the text changes and the original text (e.g. "John Francis Doe" would no longer be found). We'd have to implement a more sophisticated algorithm that uses the location start and end ranges instead of the text based search and replace. We also probably couldn't do it in-place or else the range indices wouldn't match anymore (maybe we could do the replacements in descending index order or something)

JaredBett commented 5 years ago

Or I guess the other option is that when there are overlapping ranges that we collapse into a wider range that encompasses both and pick one of the info types to use for the combined range (maybe the one with the higher likelihood). That would mean we don't miss any potential PII but there is the risk that it could be too aggressive of redaction. Although we do have control over a minLikelihood parameter which could help?

geekflyer commented 5 years ago

hmm, idk. collapsing stuff into a common type is also not that trivial. In the example above it might happen that the entire name get's collapsed into MIDDLE_NAME which is certainly not what you'd want. Honestly I think like once new DLP pricing is in place we should simply try to use the built in transformation functionality of DLP (which costs extra but should be worthwhile after their pricing change) instead of wrapping ourselves on reinventing the wheel ;-)

For now I'm okay with keeping the logic as is with the risk of missing few edge cases.