Refining "Email Addresses" search term hits

sleuthkit / autopsy

Autopsy® is a digital forensics platform and graphical interface to The Sleuth Kit® and other digital forensics tools. It can be used by law enforcement, military, and corporate examiners to investigate what happened on a computer. You can even use it to recover photos from your camera's memory card.

2.42k stars 596 forks source link

I'm wondering if it is possible to refine the regex used to identify "email addresses search term hits.

I ran the "Email Addresses" function in the "Search Term Hits" using this publicly available case from NPS: http://digitalcorpora.org/corp/nps/scenarios/2011-nps-1weapondeletion/nps-2011-scenario1.E01

Many of the results were values that are unlikely to be legitimate email addresses. For example:

m0m@mPm.iP W32.Yaha.K@mm.enc ........@....Dht iVq0xhg@p.yRg

As a second suggestion, though probably less easy to solve, Email Addresses identifies literally all email addresses, including those on web pages that the user visited. There is probably some greater interest in email addresses used, for example, in webmail, an email client or a contacts list than what would be scraped from a public website (such as the info or webmaster email addresses).

//email "[A-Z0-9._%-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}" //phone number "[(]{0,1}\\d\\d\\d[)]{0,1}[\\.-]\\d\\d\\d[\\.-]\\d\\d\\d\\d" //IP address "(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])" //URL "((((ht|f)tp(s?))\\://)|www\\.)[a-zA-Z0-9\\-\\.]+\\.([a-zA-Z]{2,5})(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\;\\?\\'\\\\+&%\\$#\\=~_\\-]+))*"

sleuthkit / autopsy

Refining "Email Addresses" search term hits #303