google / codesearch

Fast, indexed regexp search over large file trees
http://swtch.com/~rsc/regexp/regexp4.html
BSD 3-Clause "New" or "Revised" License
3.66k stars 375 forks source link

prefix/suffix lists and question-marks don't play nicely together #18

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

One of the areas I got stuck on when debugging the trigram-question-mark issue, 
but might actually be a fundamental design limitation / feature, is that moving 
to prefix/suffix lists can cause the list of trigrams to drop considerably.

bash$ ./csearch -verbose 'foo_(bar)?zot' >/dev/null
2012/03/07 22:18:48 query: "foo" "oo_" "zot" ("_zo" "o_z")|("arz" "rzo")
2012/03/07 22:18:48 post query identified 0 possible files
bash$ ./csearch -verbose 'foo_(bar_)?zot' >/dev/null
2012/03/07 22:18:53 query: "foo" "oo_" "zot"
2012/03/07 22:18:53 post query identified 0 possible files

In the first case, "bar" is only three characters and stays as an exact trigram 
and is used to construct the arz/rzo entries.  When it becomes a prefix/suffix 
list (when it hits 4 characters by adding the underscore),  it no longer 
provides us with any trigram info because the empty string empties out the 
prefix and suffix lists as being "redundant" with the empty string.  ("" is a 
prefix of "ba").

I'm not sure if this is a bug or not.  I.e, _should_ we be able to transform 
prefix/suffix lists into AND/OR sets of trigrams in this case?

Original issue reported on code.google.com by dgryski on 7 Mar 2012 at 10:08