sleuthkit / autopsy

Autopsy® is a digital forensics platform and graphical interface to The Sleuth Kit® and other digital forensics tools. It can be used by law enforcement, military, and corporate examiners to investigate what happened on a computer. You can even use it to recover photos from your camera's memory card.
http://www.sleuthkit.org/autopsy/
2.42k stars 596 forks source link

Refining "Email Addresses" search term hits #303

Open mitchwander opened 11 years ago

mitchwander commented 11 years ago

I'm wondering if it is possible to refine the regex used to identify "email addresses search term hits.

I ran the "Email Addresses" function in the "Search Term Hits" using this publicly available case from NPS: http://digitalcorpora.org/corp/nps/scenarios/2011-nps-1weapondeletion/nps-2011-scenario1.E01

Many of the results were values that are unlikely to be legitimate email addresses. For example:

m0m@mPm.iP W32.Yaha.K@mm.enc ........@....Dht iVq0xhg@p.yRg

As a second suggestion, though probably less easy to solve, Email Addresses identifies literally all email addresses, including those on web pages that the user visited. There is probably some greater interest in email addresses used, for example, in webmail, an email client or a contacts list than what would be scraped from a public website (such as the info or webmaster email addresses).

bcarrier commented 11 years ago

Yea, that regexp could certainly be improved a bit to prevent sequences of "...foo".

If someone else wants to update them, they are listed below (inside of KeywordSearchListsAbstract.java for future reference):

//email
"[A-Z0-9._%-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}"

//phone number
"[(]{0,1}\\d\\d\\d[)]{0,1}[\\.-]\\d\\d\\d[\\.-]\\d\\d\\d\\d"

//IP address
"(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])"

//URL
"((((ht|f)tp(s?))\\://)|www\\.)[a-zA-Z0-9\\-\\.]+\\.([a-zA-Z]{2,5})(\\:[0-9]+)*(/($|[a-zA-Z0-9\\.\\,\\;\\?\\'\\\\+&%\\$#\\=~_\\-]+))*"