google / zoekt

Fast trigram based code search
1.67k stars 113 forks source link

Ability to configure maxTrigramCount #99

Closed ngirard closed 4 years ago

ngirard commented 4 years ago

I wanted to see how Zoekt behaves with web pages, especially those containing source code blocks within <code> tags.

I downloaded and indexed a sample web page, and couldn't find it within Zoekt's query results.

Steps to reproduce:

  1. mkdir -p ~/sandboxes/www/blog.burntsushi.net
  2. Save https://blog.burntsushi.net/transducers/ as ~/sandboxes/www/blog.burntsushi.net/transducers.html using Firefox
  3. $GOPATH/bin/zoekt-index ~/sandboxes/www
  4. Visit http://localhost:6070/search?q=set.fst&num=50

I expected to see transducers.html within the results, as the page do contains set.fst, but the query returned nothing.

Querying other terms gave the same result.

hanwen commented 4 years ago

when you look for transducer and click the result, you'll see:

NOT-INDEXED: document size 198439 larger than limit 131072

there is a flag to control the max file size.

ngirard commented 4 years ago

Oh, good catch, thanks !

Unfortunately, doing

$GOPATH/bin/zoekt-index -file_limit 5242880 ~/sandboxes/www

leads to another error:

NOT-INDEXED: number of trigrams exceeds 20000

If I understand correctly, maxTrigramCount is declared as a const and cannot be changed via command-line.

hanwen commented 4 years ago

https://gerrit-review.googlesource.com/c/zoekt/+/233136

hanwen commented 4 years ago

fixed in 8a675eb1298df7f61916323717ab57c122678e09

ngirard commented 4 years ago

Great, thanks !