Closed jayvdb closed 5 years ago
Two more partially obfuscated
//medium.com/@username
username@blogspot
Technically those are valid emails. Maybe we can have two extraction modes. One for technically correct emails by the RFC and one for sane emails. What do you think?
Ya, depends on what you call an 'email address'.
@blogspot
is not a registered domain and thus can not be an email address, except on an intranet, according to rfc5321(smtp) which requires "This makes the requirement, described in more detail below, that only fully-qualified domain names appear in SMTP transactions on the public Internet, particularly important where top-level domains are involved."
Note however that ..@-webkit-keyframes
is illegal because it starts with a hyphen.
So are you suggesting we try to resolve domains? I was thinking we require a period in the host part and remove special characters likes {
and !
for the user name in the "sane" extraction mode.
Resolving domain names will have to use some known server like 8.8.8.8 otherwise the user's ISP might resolve unknown domains to their internal search engine.
The list of valid top level domains is easy to obtain, and then requiring a dot for anything else will eliminate most invalid results.
I wouldnt use a tool option which removed results which had special characters in them; they are legal, even tho impractical, and in any large batch there will be one used - there is always one person who has pushed the boundaries of acceptability.
Validating the domain is an easier way to remove the problems in this issue. The domain validation logic should be in a separate library IMO, as that algorithm will likely be modified more frequently than this library.
I pushed version 0.2 that should only match against TLDs from http://data.iana.org/TLD/tlds-alpha-by-domain.txt Let me know how that works out for you.
In a batch of about ~2000 emails embedded in HTML, here are the results which were invalid emails