Codebase semantic search

For on boarding new developers, or just to synchronize existing developers, we should be generating embeddings of our code base. We can do per file or on the function level, depending on which has better performance.

I asked GPT for the best sorting method on the embedding leaderboard.

Later it clarified that we should be sorting by "retrieval with instructions" for best results.

FollowIR-7B is the highest ranked.

We should be doing this in two steps:

Primarily, we should be generating embeddings of the new files when pull requests are merged.
We should have a tool that we can run manually (manual dispatch on GitHub actions ok) to generate all the embeddings. This would be expected to be used if there were pushes directly to the default branch and were not added via pull request

additional context

For this use case (codebase question-answering), your focus should be on retrieval tasks. Based on the criteria provided, here’s how I would prioritize the metrics:

Retrieval Average (15 datasets): This should be your highest priority. The better the retrieval capability of the model, the more relevant code chunks it will retrieve when you ask a question about the codebase.

Embedding Dimensions: A higher embedding dimension may provide more nuanced representations of code and questions, improving retrieval accuracy. However, this needs to be balanced with memory usage and model size.

Model Size (Million Parameters): Larger models tend to perform better in generating high-quality embeddings, but at the cost of memory and speed. Consider how much memory and computational power you can afford.

Max Tokens: A higher max token limit is useful for code because some functions or files can be quite large. You'll want a model that can handle bigger chunks of code.

Classification Average (12 datasets): Code-related tasks sometimes involve classification (e.g., determining the type of question or identifying sections of code). A higher classification score can help in such scenarios.

STS Average (10 datasets): Semantic Textual Similarity (STS) is also important as it measures how well the embeddings capture semantic meaning, which is useful for understanding the context and retrieving the right code section.

Sorting Criteria:

Sort by Retrieval Average first, then consider Max Tokens and Embedding Dimensions for practical handling of code and performance optimization.

The other metrics like Classification, Clustering, and Reranking are less critical for your specific use case, but they can help refine the quality if you have a secondary need for such tasks.

ubiquity-os-marketplace / generate-vector-embeddings

Codebase semantic search #10

additional context

Sorting Criteria: