Benefit: Helps go beyond the regexes in detecting email addresses and brings more domain-specific knowledge about what valid email addresses look like. False positives will be reduced and it will be easier to bring attention to email addresses that are routable.
Similar features: This would be like the CredtCard.setOnlyValidNumbers option where the LUHN check gives an extra sanity check on top of the regular expressions. EmailAddress.setOnlyValidTLDs would require parsing out the top-level domain, and doing a dictionary lookup, after the regular expression has matched, and dropping any matches where the TLD is unknown.
Compatibility: EmailAddress.setOnlyValidTLDs would be set false by default so that this new filtering isn't a breaking change for any client applications. It's also expected that the TLD check adds some CPU overhead, which is another reason to leave this disabled unless requested by the application.
Risks: I'm not sure how often the TLD list actually changes and what kind of maintenance burden that would represent. Is ICANN the best source or are multple sources needed?
Alternate paths: The validness of the TLD could also be reflected in the confidence of a span. Email address matches are currently 0.9 without any cases where confidence is reduced by the filter. While the TLD check could be expressed via confidence, this would not be as obvious nor work as well as an orthogonal configuration option like onlyValidTLDs would.
Workaround: If you're getting a lot of noise due to a few unroutable TLDs, you could specify an ignoredPattern to exclude these, but this puts the effort on the client application to maintain these lists.
Would be cool to have an
EmailAddress.setOnlyValidTLDs
options that would filter out email addresses where the top-level domain isn't valid. (According to https://www.icann.org/resources/pages/tlds-2012-02-25-en)Benefit: Helps go beyond the regexes in detecting email addresses and brings more domain-specific knowledge about what valid email addresses look like. False positives will be reduced and it will be easier to bring attention to email addresses that are routable.
Similar features: This would be like the
CredtCard.setOnlyValidNumbers
option where the LUHN check gives an extra sanity check on top of the regular expressions.EmailAddress.setOnlyValidTLDs
would require parsing out the top-level domain, and doing a dictionary lookup, after the regular expression has matched, and dropping any matches where the TLD is unknown.Compatibility:
EmailAddress.setOnlyValidTLDs
would be set false by default so that this new filtering isn't a breaking change for any client applications. It's also expected that the TLD check adds some CPU overhead, which is another reason to leave this disabled unless requested by the application.Risks: I'm not sure how often the TLD list actually changes and what kind of maintenance burden that would represent. Is ICANN the best source or are multple sources needed?
Alternate paths: The validness of the TLD could also be reflected in the confidence of a span. Email address matches are currently
0.9
without any cases where confidence is reduced by the filter. While the TLD check could be expressed via confidence, this would not be as obvious nor work as well as an orthogonal configuration option likeonlyValidTLDs
would.Workaround: If you're getting a lot of noise due to a few unroutable TLDs, you could specify an
ignoredPattern
to exclude these, but this puts the effort on the client application to maintain these lists.