google / codesearch

Fast, indexed regexp search over large file trees
http://swtch.com/~rsc/regexp/regexp4.html
BSD 3-Clause "New" or "Revised" License
3.66k stars 375 forks source link

Adding more trigrams makes post query worse #16

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
bash$ csearch -i -verbose '%%0.*sprintf'
2012/02/29 16:02:11 query: "%%0" 
("SPR"|"SPr"|"SpR"|"Spr"|"sPR"|"sPr"|"spR"|"spr")|("\xbfPR" "ſP")|("\xbfPr" 
"ſP")|("\xbfpR" "ſp")|("\xbfpr" "ſp") 
("PRI"|"PRi"|"PrI"|"Pri"|"pRI"|"pRi"|"prI"|"pri") 
("RIN"|"RIn"|"RiN"|"Rin"|"rIN"|"rIn"|"riN"|"rin") 
("INT"|"INt"|"InT"|"Int"|"iNT"|"iNt"|"inT"|"int") 
("NTF"|"NTf"|"NtF"|"Ntf"|"nTF"|"nTf"|"ntF"|"ntf")
2012/02/29 16:02:11 post query identified 3295 possible files
bash$ csearch -i -verbose '%%0' >/dev/null
2012/02/29 16:02:24 query: "%%0"
2012/02/29 16:02:24 post query identified 9 possible files

Since only 9 files match '%%0', I would expect no more than 9 possible files 
from the post query when adding '.*sprintf' to the query string.  It looks like 
something is ignoring the 'restrict' list somewhere.

I'm still working on a minimal test case.  I'll look into this more tonight or 
so.

Original issue reported on code.google.com by dgryski on 29 Feb 2012 at 3:10

GoogleCodeExporter commented 9 years ago
postReader init() was not initializing restrict correctly, so all r.next() 
calls returned the full list list and not the restricted set.

Patch attached.

Original comment by dgryski on 29 Feb 2012 at 3:43

Attachments:

GoogleCodeExporter commented 9 years ago
With this fix, the 599 files matching 'hello world' in linux-3.1.3 example ( 
http://swtch.com/~rsc/regexp/regexp4.html ) is reduced to 32:

/tmp dgryski$ CSEARCHINDEX=/tmp/.csearch-linux csearch -verbose -i -c 'hello 
world' 
2012/03/07 11:45:20 query: ("HEL"|"HEl"|"HeL"|"Hel"|"hEL"|"hEl"|"heL"|"hel") 
("ELL"|"ELl"|"ElL"|"Ell"|"eLL"|"eLl"|"elL"|"ell") 
("LLO"|"LLo"|"LlO"|"Llo"|"lLO"|"lLo"|"llO"|"llo") ("LO "|"Lo "|"lO "|"lo ") ("O 
W"|"O w"|"o W"|"o w") (" WO"|" Wo"|" wO"|" wo") 
("WOR"|"WOr"|"WoR"|"Wor"|"wOR"|"wOr"|"woR"|"wor") 
("ORL"|"ORl"|"OrL"|"Orl"|"oRL"|"oRl"|"orL"|"orl") 
("RLD"|"RLd"|"RlD"|"Rld"|"rLD"|"rLd"|"rlD"|"rld")
2012/03/07 11:45:20 post query identified 32 possible files
/tmp/linux-3.1.3/Documentation/filesystems/ramfs-rootfs-initramfs.txt: 3
/tmp/linux-3.1.3/Documentation/java.txt: 1
/tmp/linux-3.1.3/Documentation/s390/Debugging390.txt: 3
/tmp/linux-3.1.3/arch/blackfin/kernel/kgdb_test.c: 1
/tmp/linux-3.1.3/arch/frv/kernel/gdb-stub.c: 1
/tmp/linux-3.1.3/arch/mn10300/kernel/gdb-stub.c: 1
/tmp/linux-3.1.3/arch/powerpc/platforms/powermac/udbg_scc.c: 1
/tmp/linux-3.1.3/drivers/media/video/msp3400-driver.c: 1
/tmp/linux-3.1.3/drivers/net/sfc/selftest.c: 1
/tmp/linux-3.1.3/samples/kdb/kdb_hello.c: 2

Original comment by dgryski on 7 Mar 2012 at 10:48

GoogleCodeExporter commented 9 years ago
This issue was closed by revision dbeef1612f35.

Original comment by dgryski on 2 May 2012 at 8:02