Closed jjeffwarner closed 8 months ago
Heads up @jjeffwarner - the "team/search-core" label was applied to this issue.
This issue has been inactive for a long time. To reopen the ticket, please let us know how to reproduce the issue on latest main. For feature requests, please let us know what is still missing.
Introduction
Currently, Zoekt’s indexing process works like this:
Periodically poll Sourcegraph’s API for a list of repositories that we should be indexing, along with a list of branches to index:
Some time later, check the current on-disk index state for that repository (
sourcegraph/sourcegraph
in this case) to see if matches the state that the API said that we should be at (same list of branches, same commits for those branches, etc.)If the on-disk state is different, then kick off a new indexing job. This consists of two main steps:
Fetching the git data for all the branches
Adding every file from the version of the repository at a given commit to the shard builder, generating new shards, and deleting the old shards
Commentary
This existing workflow is simple, but there are aspects of these that aren’t ideal for monorepo indexing performance:
One can imagine that large monorepos exacerbate both of these issues.
Delta builds
Our idea is to see if we can leverage previously computed index state to only do an amount of work that’s proportional to the amount of files that changed in the latest commit.
To that end, we’re introducing the concept of delta builds. Unlike normal builds that throw away the previous state and re-index everything from scratch, delta builds work by stacking changes on top of the existing state.
Delta builds produce delta shards, which only contain index data from the changed files in the latest commit.
Delta builds also update the metadata in older shards to indicate that a particular version of a file is “outdated”. This is done with a marker called a file tombstone, which indicates that any search results from a shard that involve this file should be ignored.
Our belief is that only indexing the changed files in a commit should be much less expensive than re-indexing entire repository over-and-over again. As a result, delta builds should make search results from the latest commits available much more quickly that normal builds. However, this scheme introduces a lot of sub-problems that need to be solved.
Sub-problems
1. Making Zoket’s searcher aware of
FileTombstones
Zoekt’s searcher needs to be aware of a new field,
FileTombstones
, and understand how to interpret it + skip over files if they’re tombstoned.Status:
This was implemented in https://github.com/sourcegraph/zoekt/pull/273. There are aspects of this that will probably need to be made performant for the initial release (see https://github.com/sourcegraph/zoekt/pull/273/files#diff-07fd8e51cdf14130405a961100cecaf6bdc47e746a4d9ea1de1c06422fc7e1fdR248-R251), but this isn’t necessary for the prototype.
2. Making Zoekt’s shard builder properly manipulate the index state for delta builds
Zoekt’s shard builder now needs to know how to do the following for delta builds:
Filetombstones
fields on all the older shards (and only the older shards)Status
This was implemented in https://github.com/sourcegraph/zoekt/pull/274.
3. Instrumenting the callers of Zoekt’s shard builder for delta builds
For delta builds, Zoekt now needs to:
In addition, Zoekt needs to know under what circumstances it’ll need to “fall back” to a normal build. Examples:
Status
This is complete.
4. Error recovery
Zoekt’s normal builds delete the old state for every indexing job. This meant that if there was some sort of error (either during the build process itself, Zoekt crashed, etc), Zoekt would eventually recover (the old state would eventually be cleared out).
Delta builds can’t delete older shards - their fundamental operation relies on stacking new changes on top of the existing state. Any errors we’ll have won’t be passively cleared out. We’ll need some sort of mechanism to ensure that we’ll never enter this state in the first place.
Status
A stop gap solution has been implemented. See the "Compaction" section for more details, but the stop-gap solution for that sub-problem also addresses this one.
5. Compaction
Each new shard introduces similar overhead / costs in memory usage and search times. This means that search time / latency can bog down if we keep adding more shards. Since Delta shards only add new shards for each commit - this might be a problem for repositories with high commit frequencies. We’ll need to introduce a process to merge delta shards into a single shard.
Status
A stop gap-solution has been implemented. After every 150 shards (the number is configurable), we fall back to a full normal build (which will delete all of the older shards, including the delta ones).
This solution has a number of drawbacks (the fallback might happen at an inopportune time, larger repositories are disproportionately affected, etc.). However, this solution is easy to understand, gives us a free error handling mechanism (since we delete all the older shards), limits the amount of memory pressure that the delta shards can introduce, and clearly still has better overall throughput than the old strategy.
This simple strategy is in a good enough place to where customers could still try this and benefit from it. However, we might need a more robust solution to the compaction problem depending on their feedback.
6. Git fetch costs
The delta build process described above doesn’t address the network cost of fetching all the git data required to construct the repository for a given commit from scratch. (in fact, the way it’s implemented could fetch slightly more git data than before since we need two commits in order to calculate the
git diff
).There are a few ways that address this, but the most attractive option is probably modifying
gitserver
to calculate the diff itself.Status
Not implemented, and not necessary for the MVP.
We profiled the indexing costs for
sgtest/megarepo
, and the network transfer time only accounted for 1/3 of the total cost. From that, we decided that optimizing the shard build process is a better use of our time for now.