google / zoekt

Fast trigram based code search
1.69k stars 113 forks source link

Missing files and results #52

Closed nikhilkalige closed 6 years ago

nikhilkalige commented 6 years ago

I am unable to see all search results for a string that I was trying to search. If I just look at the index file and grep it, I get more results then what I am seeing in the webpage.

I also found that certain files are never indexed. File search for these shows zero results, also these files do not show up in the index file too..

hanwen commented 6 years ago

it's hard to give specific answers with specific data.

files that are never indexed are usually too large or binary see also https://github.com/google/zoekt/blob/2f0c63016fa7950a2c913acc0722b09c0435b22e/cmd/zoekt-git-index/main.go#L31

hanwen commented 6 years ago

which version are you using?

do the stats (bottom of page) indicate that data was skipped?

can you reduce the example to something smaller?

nikhilkalige commented 6 years ago

Stats

 Used 10M mem for 16953 documents (58M) from 1 repositories.

The file is a .c file with size 259K which would be greater than 128K.. That may indicate the problem.. However, its hard to pass options into git_index_flags, I can't find a way to pass more that one flag into it.

Thanks..

hanwen commented 6 years ago

I think you can do -git_index_flags="-flag1 flag2"

https://github.com/google/zoekt/blob/2f0c63016fa7950a2c913acc0722b09c0435b22e/cmd/zoekt-indexserver/main.go#L153

if you want to clarify the help string there, that would be great.

nikhilkalige commented 6 years ago

I tried "-branches=master,develop sizeMax=1048576", "-branches=master,develop -sizeMax=1048576" "'-branches=master,develop sizeMax=1048576'"

nikhilkalige commented 6 years ago

May be we could do strings.Split() and - as prefix, so that you could pass "flag1=data flag2=data"

hanwen commented 6 years ago

yeah, good idea. Send me a change.

hanwen commented 6 years ago

any further comments on "If I just look at the index file and grep it, I get more results then what I am seeing in the webpage." ? Is this a cutoff by number of matches, or does it really not show up (try restricting to the file you know it should be in.)

nikhilkalige commented 6 years ago

I think increasing the size fixed it.. Let me investigate more and see if I can get more info if it concurs..

nikhilkalige commented 6 years ago

I still seem to get this problem, cat reponame.zoekt | grep stringval gives me 19 values, while the search gives me only 4. The stringval also shows up in files that are not as big as the one I mentioned in prior comments.

The file which have stringval are indexed properly, as I can get good results for other values from these files. Does the length (42 characters) of the searched string matter?

hanwen commented 6 years ago

if you do search for stringval, and restrict the search to a file that you know contains it (using "f:path/to/file"), does that return the data?

(I'd also be happy to debug the shard directly, if you are able to share it privately with me.)

nikhilkalige commented 6 years ago

Yup, using f:path works..

Sorry :(, can't really share the data..

hanwen commented 6 years ago

can you check that for incomplete results, the following condition triggers? https://github.com/google/zoekt/blob/2f0c63016fa7950a2c913acc0722b09c0435b22e/eval.go#L159

nikhilkalige commented 6 years ago

I think this is related to the web server. If i run ./zoekt -index_dir /var/data/index/ "stingval" | wc -l, then I get 19 results.

hanwen commented 6 years ago

If that is true, the webserver should show a "Show more" link next to the results.

nikhilkalige commented 6 years ago

oops.. crap.. It does... My mistake.. sorry about bothering you.. I did not expect that..

hanwen commented 6 years ago

did you get many matches for "stringval" that were symbol defintions?

How large is the corpus (number of files, number of bytes)? You can query "r:"

I got bitten by this today as well. I think we should make this more visible.

nikhilkalige commented 6 years ago

Found 1 repositories (17517 files, 82Mb content)

I would say 11/19 are valid code and the rest 8 are comments. If I try sym:stringval, I get 4 results. I considered every presence of stingval inside a piece of code as a symbol, may be that wrongs?

The other problem seems to be numerous files that show up tagged Duplicate result, but with the same path.

hanwen commented 6 years ago

the sym: operator looks for symbol definitions, eg

class Blabla { .. }

in c++.

Looks your files are tiny (~ 500 bytes each), which throws off some coarse heuristics for matchcount that I introduced.

Re: duplicate results, are you indexing multiple branches? Does your project use submodules? From which branches do the duplicate results come from?

nikhilkalige commented 6 years ago

I am trying to index three branches.. The result I get is somethings like

MdEmbed.c [branch1]
stringval

MdEmbed.c [branch1] DuplicateResult
MdEmbed.c [branch1] DuplicateResult

MdEmbed.c [branch2]
stringval

MdEmbed.c [branch2] DuplicateResult
MdEmbed.c [branch2] DuplicateResult

MdEmbed.c [branch3]
stringval

MdEmbed.c [branch3] DuplicateResult
MdEmbed.c [branch3] DuplicateResult
hanwen commented 6 years ago

that is weird. Each (branch, filename) combo should be there just once. How much files does a single branch have, and how many distinct (filename, filecontent) pairs should you have roughly?

nikhilkalige commented 6 years ago

The 3 branches are almost the same, they are usually merged back and forth every 2-3 days.

hanwen commented 6 years ago

how many files does each branch have?

hanwen commented 6 years ago

see https://github.com/google/zoekt/issues/55

hanwen commented 6 years ago

can you try the latest version and see if it improved?

nikhilkalige commented 6 years ago

16977 files.. Awesome.. that was perfect