jjeffwarner commented 2 years ago

Introduction

Currently, Zoekt’s indexing process works like this:

Periodically poll Sourcegraph’s API for a list of repositories that we should be indexing, along with a list of branches to index:
Some time later, check the current on-disk index state for that repository (sourcegraph/sourcegraph in this case) to see if matches the state that the API said that we should be at (same list of branches, same commits for those branches, etc.)
If the on-disk state is different, then kick off a new indexing job. This consists of two main steps:
1. Fetching the git data for all the branches
```
git fetch commitA commitB ...
```
2. Adding every file from the version of the repository at a given commit to the shard builder, generating new shards, and deleting the old shards
```
for oldShard in shardBuilder.ListShards()
    os.rm(oldShard)

for commit in ["commitA", "commitB", ...]
    for file in repository.checkout(commit).getAllFiles()
            shardBuilder.Add(file)

shardBuilder.Run()
```

Commentary

This existing workflow is simple, but there are aspects of these that aren’t ideal for monorepo indexing performance:

We fetch the repository’s git data from scratch for every single indexing job. Even if the latest commit only added a single file, we’d still have to fetch all the data required to construct the entire work tree.
We index every file that’s present in the version of the repository at a given commit, even if that file wasn’t modified in the commit.

One can imagine that large monorepos exacerbate both of these issues.

Delta builds

Our idea is to see if we can leverage previously computed index state to only do an amount of work that’s proportional to the amount of files that changed in the latest commit.

To that end, we’re introducing the concept of delta builds. Unlike normal builds that throw away the previous state and re-index everything from scratch, delta builds work by stacking changes on top of the existing state.

Delta builds produce delta shards, which only contain index data from the changed files in the latest commit.

Delta builds also update the metadata in older shards to indicate that a particular version of a file is “outdated”. This is done with a marker called a file tombstone, which indicates that any search results from a shard that involve this file should be ignored.

Our belief is that only indexing the changed files in a commit should be much less expensive than re-indexing entire repository over-and-over again. As a result, delta builds should make search results from the latest commits available much more quickly that normal builds. However, this scheme introduces a lot of sub-problems that need to be solved.

Sub-problems

1. Making Zoket’s searcher aware of `FileTombstones`

Zoekt’s searcher needs to be aware of a new field, FileTombstones, and understand how to interpret it + skip over files if they’re tombstoned.

Status:

This was implemented in https://github.com/sourcegraph/zoekt/pull/273. There are aspects of this that will probably need to be made performant for the initial release (see https://github.com/sourcegraph/zoekt/pull/273/files#diff-07fd8e51cdf14130405a961100cecaf6bdc47e746a4d9ea1de1c06422fc7e1fdR248-R251), but this isn’t necessary for the prototype.

2. Making Zoekt’s shard builder properly manipulate the index state for delta builds

Zoekt’s shard builder now needs to know how to do the following for delta builds:

set the Filetombstones fields on all the older shards (and only the older shards)
update the version metadata in the older shards to point to the latest commit
understand when it’s not possible to do a delta build (example: the set of branch names to index has changed - which forces a recompute of everything from scratch), and give appropriate errors

Status

This was implemented in https://github.com/sourcegraph/zoekt/pull/274.

3. Instrumenting the callers of Zoekt’s shard builder for delta builds

For delta builds, Zoekt now needs to:

read its current index state to understand what the commit it last indexed was
fetch the git data for both the last commit it indexed and the most recent commit
calculate the git diff information between the last commit and most recent commit
add only the documents that were added/changed in the latest commit to the shard builder
Tell the shard builder what files need to be tombstoned (files that were changed or deleted)

In addition, Zoekt needs to know under what circumstances it’ll need to “fall back” to a normal build. Examples:

A fundamental change in the repository metadata (such as the set of branches to index) will force a rebuild
If a branch has been force pushed in such a way that the previous commit no longer exists, then we have to do a normal build
If certain errors occur during a delta build, then it might be appropriate to immediately trigger a new normal build

Status

This is complete.

4. Error recovery

Zoekt’s normal builds delete the old state for every indexing job. This meant that if there was some sort of error (either during the build process itself, Zoekt crashed, etc), Zoekt would eventually recover (the old state would eventually be cleared out).

Delta builds can’t delete older shards - their fundamental operation relies on stacking new changes on top of the existing state. Any errors we’ll have won’t be passively cleared out. We’ll need some sort of mechanism to ensure that we’ll never enter this state in the first place.

Status

A stop gap solution has been implemented. See the "Compaction" section for more details, but the stop-gap solution for that sub-problem also addresses this one.

5. Compaction

Each new shard introduces similar overhead / costs in memory usage and search times. This means that search time / latency can bog down if we keep adding more shards. Since Delta shards only add new shards for each commit - this might be a problem for repositories with high commit frequencies. We’ll need to introduce a process to merge delta shards into a single shard.

Status

A stop gap-solution has been implemented. After every 150 shards (the number is configurable), we fall back to a full normal build (which will delete all of the older shards, including the delta ones).

This solution has a number of drawbacks (the fallback might happen at an inopportune time, larger repositories are disproportionately affected, etc.). However, this solution is easy to understand, gives us a free error handling mechanism (since we delete all the older shards), limits the amount of memory pressure that the delta shards can introduce, and clearly still has better overall throughput than the old strategy.

This simple strategy is in a good enough place to where customers could still try this and benefit from it. However, we might need a more robust solution to the compaction problem depending on their feedback.

6. Git fetch costs

The delta build process described above doesn’t address the network cost of fetching all the git data required to construct the repository for a given commit from scratch. (in fact, the way it’s implemented could fetch slightly more git data than before since we need two commits in order to calculate the git diff).

There are a few ways that address this, but the most attractive option is probably modifying gitserver to calculate the diff itself.

Status

Not implemented, and not necessary for the MVP.

We profiled the indexing costs for sgtest/megarepo, and the network transfer time only accounted for 1/3 of the total cost. From that, we decided that optimizing the shard build process is a better use of our time for now.

github-actions[bot] commented 2 years ago

Heads up @jjeffwarner - the "team/search-core" label was applied to this issue.

stefanhengl commented 8 months ago

This issue has been inactive for a long time. To reopen the ticket, please let us know how to reproduce the issue on latest main. For feature requests, please let us know what is still missing.

sourcegraph / sourcegraph-public-snapshot

Incremental indexing to Zoekt #29731

Introduction

Commentary

Delta builds

Sub-problems

1. Making Zoket’s searcher aware of `FileTombstones`

Status:

2. Making Zoekt’s shard builder properly manipulate the index state for delta builds

Status

3. Instrumenting the callers of Zoekt’s shard builder for delta builds

Status

4. Error recovery

Status

5. Compaction

Status

6. Git fetch costs

Status

sourcegraph / sourcegraph-public-snapshot

Incremental indexing to Zoekt #29731

Introduction

Commentary

Delta builds

Sub-problems

1. Making Zoket’s searcher aware of FileTombstones

Status:

2. Making Zoekt’s shard builder properly manipulate the index state for delta builds

Status

3. Instrumenting the callers of Zoekt’s shard builder for delta builds

Status

4. Error recovery

Status

5. Compaction

Status

6. Git fetch costs

Status

1. Making Zoket’s searcher aware of `FileTombstones`