kichik / email-scraper

Simple Python library to scrape email addresses from HTML
MIT License
21 stars 10 forks source link

Incorrect matches #2

Closed jayvdb closed 5 years ago

jayvdb commented 6 years ago

In a batch of about ~2000 emails embedded in HTML, here are the results which were invalid emails

20px!important}}@media
//js.intercomcdn.com/images/messenger-close@2x.c1cb8613.png
ease!important}@-webkit-keyframes
italic}@font-face
0}}@keyframes
jayvdb commented 6 years ago

Two more partially obfuscated

//medium.com/@username
username@blogspot  
kichik commented 6 years ago

Technically those are valid emails. Maybe we can have two extraction modes. One for technically correct emails by the RFC and one for sane emails. What do you think?

jayvdb commented 6 years ago

Ya, depends on what you call an 'email address'.

@blogspot is not a registered domain and thus can not be an email address, except on an intranet, according to rfc5321(smtp) which requires "This makes the requirement, described in more detail below, that only fully-qualified domain names appear in SMTP transactions on the public Internet, particularly important where top-level domains are involved."

Note however that ..@-webkit-keyframes is illegal because it starts with a hyphen.

kichik commented 6 years ago

So are you suggesting we try to resolve domains? I was thinking we require a period in the host part and remove special characters likes { and ! for the user name in the "sane" extraction mode.

Resolving domain names will have to use some known server like 8.8.8.8 otherwise the user's ISP might resolve unknown domains to their internal search engine.

jayvdb commented 6 years ago

The list of valid top level domains is easy to obtain, and then requiring a dot for anything else will eliminate most invalid results.

I wouldnt use a tool option which removed results which had special characters in them; they are legal, even tho impractical, and in any large batch there will be one used - there is always one person who has pushed the boundaries of acceptability.

Validating the domain is an easier way to remove the problems in this issue. The domain validation logic should be in a separate library IMO, as that algorithm will likely be modified more frequently than this library.

kichik commented 6 years ago

I pushed version 0.2 that should only match against TLDs from http://data.iana.org/TLD/tlds-alpha-by-domain.txt Let me know how that works out for you.