philterd / phileas

The open source PII and PHI redaction engine
https://www.philterd.ai
Apache License 2.0
24 stars 5 forks source link

Validate top-level domains for email addresses #131

Closed robfromboulder closed 2 months ago

robfromboulder commented 3 months ago

Would be cool to have an EmailAddress.setOnlyValidTLDs options that would filter out email addresses where the top-level domain isn't valid. (According to https://www.icann.org/resources/pages/tlds-2012-02-25-en)

Benefit: Helps go beyond the regexes in detecting email addresses and brings more domain-specific knowledge about what valid email addresses look like. False positives will be reduced and it will be easier to bring attention to email addresses that are routable.

Similar features: This would be like the CredtCard.setOnlyValidNumbers option where the LUHN check gives an extra sanity check on top of the regular expressions. EmailAddress.setOnlyValidTLDs would require parsing out the top-level domain, and doing a dictionary lookup, after the regular expression has matched, and dropping any matches where the TLD is unknown.

Compatibility: EmailAddress.setOnlyValidTLDs would be set false by default so that this new filtering isn't a breaking change for any client applications. It's also expected that the TLD check adds some CPU overhead, which is another reason to leave this disabled unless requested by the application.

Risks: I'm not sure how often the TLD list actually changes and what kind of maintenance burden that would represent. Is ICANN the best source or are multple sources needed?

Alternate paths: The validness of the TLD could also be reflected in the confidence of a span. Email address matches are currently 0.9 without any cases where confidence is reduced by the filter. While the TLD check could be expressed via confidence, this would not be as obvious nor work as well as an orthogonal configuration option like onlyValidTLDs would.

Workaround: If you're getting a lot of noise due to a few unroutable TLDs, you could specify an ignoredPattern to exclude these, but this puts the effort on the client application to maintain these lists.

jzonthemtn commented 2 months ago

Great suggestion. Added in PR #135. The TLD check defaults to false. The list of TLDs is easy enough to periodically update.