Closed stefanhengl closed 5 months ago
golden queries evals
before
Breakdown by class:
Find symbol 9/10
Find string 2/2
Explain file 2/2
Explain concept 4/5
Check dependency 1/2
Find logic 31/43
Gather information 12/16
Changelog 0/2
Ownership 2/2
How-to 1/1
Foreign language 0/2
Long request 0/2
Breakdown by file type:
TSX 4/5
Golang 17/24
Typescript 27/38
C++ 1/3
Markdown 7/10
Graphql 1/1
JSON 1/1
Go 3/4
Codeowners 2/2
Python 1/1
Combined recall 64/89
after
Breakdown by class:
Find symbol 9/10
Find string 2/2
Explain file 2/2
Explain concept 4/5
Check dependency 1/2
Find logic 31/43
Gather information 11/16
Changelog 0/2
Ownership 2/2
How-to 1/1
Foreign language 0/2
Long request 0/2
Breakdown by file type:
TSX 4/5
Golang 17/24
Typescript 26/38
C++ 1/3
Markdown 7/10
Graphql 1/1
JSON 1/1
Go 3/4
Codeowners 2/2
Python 1/1
Combined recall 63/89
This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.
I'd love to see these evals in particular:
This looks good! Even if it doesn't improve the results over our current baseline, I feel good about this since we should really be implementing BM25 properly. Also we've definitely overfit to the golden queries by now, so I'm not too worried if we don't see a clear improvement.
I'd love to see these evals in particular:
- Compare to latest golden queries snapshot
see above
- Compare to golden queries without my latest change to identify and search symbol definitions
golden queries evals
This PR with Sourcegraph@4f465c5
, which is the parent commit of your latest change to identify and search symbol definitions, see here
Breakdown by class:
Find symbol 9/10
Find string 2/2
Explain file 2/2
Explain concept 4/5
Check dependency 1/2
Find logic 31/43
Gather information 11/16
Changelog 0/2
Ownership 2/2
How-to 1/1
Foreign language 0/2
Long request 0/2
Breakdown by file type:
TSX 4/5
Golang 17/24
Typescript 26/38
C++ 1/3
Markdown 7/10
Graphql 1/1
JSON 1/1
Go 3/4
Codeowners 2/2
Python 1/1
Combined recall 63/89
CodeSearchNet
Sourcegraph@4f465c5
Before
Recall (files) 91/99
Recall (chunks) 75/99
Average chunk overlap 0.89
After
Recall (files) 89/99
Recall (chunks) 77/99
Average chunk overlap 0.88
So far, we didn't include IDF in our BM25 score function. Zoekt uses a trigram index and hence doesn't compute document frequency during indexing. We could add this information to the index, but it is not immediately obvious how to tokenize code in a way that is compatible with tokens from a natural language query.
Here we calulate the document frequency at query time under the assumption that we visit all documents containing any of the query terms.
Notes: Also fixed an off-by-1 bug with how we count documents.
Test plan: