Open mitchwander opened 11 years ago
Yea, that regexp could certainly be improved a bit to prevent sequences of "...foo".
If someone else wants to update them, they are listed below (inside of KeywordSearchListsAbstract.java for future reference):
//email
"[A-Z0-9._%-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}"
//phone number
"[(]{0,1}\\d\\d\\d[)]{0,1}[\\.-]\\d\\d\\d[\\.-]\\d\\d\\d\\d"
//IP address
"(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])"
//URL
"((((ht|f)tp(s?))\\://)|www\\.)[a-zA-Z0-9\\-\\.]+\\.([a-zA-Z]{2,5})(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\;\\?\\'\\\\+&%\\$#\\=~_\\-]+))*"
I'm wondering if it is possible to refine the regex used to identify "email addresses search term hits.
I ran the "Email Addresses" function in the "Search Term Hits" using this publicly available case from NPS: http://digitalcorpora.org/corp/nps/scenarios/2011-nps-1weapondeletion/nps-2011-scenario1.E01
Many of the results were values that are unlikely to be legitimate email addresses. For example:
m0m@mPm.iP W32.Yaha.K@mm.enc ........@....Dht iVq0xhg@p.yRg
As a second suggestion, though probably less easy to solve, Email Addresses identifies literally all email addresses, including those on web pages that the user visited. There is probably some greater interest in email addresses used, for example, in webmail, an email client or a contacts list than what would be scraped from a public website (such as the info or webmaster email addresses).