google / zoekt

Fast trigram based code search
1.69k stars 113 forks source link

ordering of search results is affected by Max Results #89

Open ijt opened 5 years ago

ijt commented 5 years ago

Increasing the max results can affect the ordering of the search results. Here is an example.

Screen Shot 2019-07-03 at 12 02 19

Having stable ordering of search results would be a useful property and less surprising for users.

ijt commented 5 years ago

I would be willing to work on this.

hanwen commented 5 years ago

How would you do it?

Basically, you the search engine shows the best result on top. If you search over a larger corpus, you can find better matches, which displaces other results and changes ordering.

ijt commented 5 years ago

One possibility would be to present the results in the order they occur within the posting lists. Items in posting lists could have an additional weight field corresponding to their estimated general relevance. The posting lists could be sorted according to the weights, and re-sorted as necessary.

That would degrade the ordering though. The question needs some more thought.

hanwen commented 5 years ago

@ijt - If you mean "index shard" when you say "posting list", this is exactly how it works already.

Within a shard, files are ordered by importance (important files first), so eg. all things equal you get matches from non-test files before test files.

Then the shards themselves are ordered by "quality" score, which is mainly powered from the github star-count. So matches in github.com/google/guava get prefernce over matches in android.googlesource.com/platform/external/guava, even though the content is the same.

The problem is that matches have quality. If you are looking for "idiot", then the word "idiot" in an unimportant shard is a better match than the identifier "bidiOther" in an important shard.

If you increase the result count to include the unimportant shard, inevitably, this will upset the ordering.

One way out of this is to have a cheaper way to find quality matches. For example, currenltly we have a section for file names, and file contents. If you add a separate corpus section for symbols (which would be smaller than file contents), you could search more shards with the same amount of CPU