sleuthkit / autopsy

Autopsy® is a digital forensics platform and graphical interface to The Sleuth Kit® and other digital forensics tools. It can be used by law enforcement, military, and corporate examiners to investigate what happened on a computer. You can even use it to recover photos from your camera's memory card.
http://www.sleuthkit.org/autopsy/
2.38k stars 592 forks source link

Keyword search regex not working in 4.5.0 #3238

Open CarlosLannister opened 6 years ago

CarlosLannister commented 6 years ago

Just tried keyword search module with two regex: ^[13][a-km-zA-HJ-NP-Z1-9]{25,34}$ ^[5KL][1-9A-HJ-NP-Za-km-z]{50,51}$

In Autopsy 4.1.1 show results and in 4.5.0 is not working, using the same evidence of course.

CarlosLannister commented 6 years ago

Version 4.3 work too. From 4.4 to above fail.

rcordovano commented 6 years ago

@CarlosLannister, changes in how we do keyword search mean that Java regular expressions need to be used. See https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html for one explanation of the syntax. I personally am not expert enough in the syntax to tell at a glance whether your expressions should work, but I wanted to let you know about the change, at the very least.

rcordovano commented 6 years ago

@CarlosLannister, I had things backward in my previous comment. We have switched from Java regex syntax to the syntax that is supported by Lucene.

Elastic has a nice tutorial: https://www.elastic.co/guide/en/elasticsearch/reference/1.6/query-dsl-regexp-query.html.

CarlosLannister commented 6 years ago

In what way a tutorial of Elastic Search is related with Apache Lucene? I just tried to escape the special characters with \ but is not working.

^\[5KL\]\[1-9A-HJ-NP-Za-km-z\]\{50,51\}$

^\[13\]\[a-km-zA-HJ-NP-Z1-9\]\{25,34\}$

The project gets freeze.

rcordovano commented 6 years ago

@CarlosLannister , the linked Elastic document provides guidance on Lucene regex syntax; both Solr and Elastic are built on top of Lucene.

I interpreted "not working" to mean that your searches are returning no results. "The project gets freeze" suggests other problems. Are you saying that the GUI is frozen? That the search takes a long time to complete, or seems to never complete?

Are you searching during ingest, or doing an ad hoc search using the drop downs in the upper right hand corner?

bcarrier commented 6 years ago

What if you remove the ^ and $? I modified your search to be less restrictive and got fewer results than I had expected. I then removed the ^ and $ and got far more.

Did you include the ^ and $ because you wanted to find files where these items were the only text on the line?

bcarrier commented 6 years ago

@CarlosLannister, we looked into this. Autopsy 4.5 changed how we index and search for regular expressions. At a high-level, we used to apply regular expressions to each term/word and now we apply it to a long "string" that is the text of the document. So, the notion of anchoring (i.e. ^ and $) has changed.

Behind the scenes, we are surrounding the regular expression terms that you enter with "." (unless you also specified .). So, your searched turned into .^[terms]$.. I'm not sure that is a valid regexp.

What was the intention of your search? Were you looking for words that meet your criteria and you were using the anchors to define word boundaries?

Our actions from this are:

CarlosLannister commented 6 years ago

Hi sorry for the delay in the answer. My intention was to search bitcoin addresses in files, like this 1F1tAaz5x1HUXrCNLbtMDqcw6o5GNn4xqX.

I include ^ and $ because I knew the existence of a file with only a bitcoin address in the text, it was just a test to find the file.

In future searches, I am going to try without ^ and $ and see if it works.

Thanks for all.

infoman60 commented 6 years ago

Hi, I'm having similar regex problems trying to specify word boundaries. I have a working regular expression for UK National Insurance number, but need to specify surrounding word boundaries to reduce false positives. Everything I try to add a word boundary breaks the regex. Many thanks

APriestman commented 6 years ago

As @bcarrier said above, the way we do indexing changed in the more recent versions of Autopsy which is why ^ and $ no longer work as word boundaries. At the moment my best suggestion is to manually put a word boundary around your regex (a small version would be [ \.\-](regex)[ \.\-] - note that the predefined character classes do not work in Autopsy), although this may have issues when the search string is at the beginning or end of a file. We did stop putting the ".*" on when ^ or $ is present, but the boundary characters still aren't working as expected.