ranking: add IDF to BM25 score calculation

stefanhengl commented 5 months ago

So far, we didn't include IDF in our BM25 score function. Zoekt uses a trigram index and hence doesn't compute document frequency during indexing. We could add this information to the index, but it is not immediately obvious how to tokenize code in a way that is compatible with tokens from a natural language query.

Here we calulate the document frequency at query time under the assumption that we visit all documents containing any of the query terms.

Notes: Also fixed an off-by-1 bug with how we count documents.

Test plan:

Updated unit test
Context evaluation results are slightly worse with a decrease from 64/89 to 63/89

stefanhengl commented 5 months ago

golden queries evals

before

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      12/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      27/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 64/89

after

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      11/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      26/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 63/89

jtibshirani commented 5 months ago

This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.

I'd love to see these evals in particular:

Compare to latest golden queries snapshot
Compare to golden queries without my latest change to identify and search symbol definitions
Run on CodeSearchNet too

stefanhengl commented 5 months ago

This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.

I'd love to see these evals in particular:

Compare to latest golden queries snapshot

see above

Compare to golden queries without my latest change to identify and search symbol definitions

golden queries evals

This PR with Sourcegraph@4f465c5, which is the parent commit of your latest change to identify and search symbol definitions, see here

Breakdown by class:
Find symbol     9/10
Find string     2/2
Explain file    2/2
Explain concept 4/5
Check dependency        1/2
Find logic      31/43
Gather information      11/16
Changelog       0/2
Ownership       2/2
How-to  1/1
Foreign language        0/2
Long request    0/2

Breakdown by file type:
TSX     4/5
Golang  17/24
Typescript      26/38
C++     1/3
Markdown        7/10
Graphql 1/1
JSON    1/1
Go      3/4
Codeowners      2/2
Python  1/1

Combined recall 63/89

stefanhengl commented 5 months ago

CodeSearchNet

Sourcegraph@4f465c5

Before

Recall (files)  91/99
Recall (chunks) 75/99
Average chunk overlap   0.89

After

Recall (files)  89/99
Recall (chunks) 77/99
Average chunk overlap   0.88

sourcegraph / zoekt

ranking: add IDF to BM25 score calculation #788