sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.28k forks source link

Incremental indexing to Zoekt #29731

Closed jjeffwarner closed 8 months ago

jjeffwarner commented 2 years ago

Introduction

Currently, Zoekt’s indexing process works like this:

  1. Periodically poll Sourcegraph’s API for a list of repositories that we should be indexing, along with a list of branches to index: inc zoekt doc - page 11(1)

  2. Some time later, check the current on-disk index state for that repository (sourcegraph/sourcegraph in this case) to see if matches the state that the API said that we should be at (same list of branches, same commits for those branches, etc.)

  3. If the on-disk state is different, then kick off a new indexing job. This consists of two main steps:

    1. Fetching the git data for all the branches

      git fetch commitA commitB ...
      
    2. Adding every file from the version of the repository at a given commit to the shard builder, generating new shards, and deleting the old shards

      
      for oldShard in shardBuilder.ListShards()
          os.rm(oldShard)
      
      for commit in ["commitA", "commitB", ...]
          for file in repository.checkout(commit).getAllFiles()
                  shardBuilder.Add(file)
      
      shardBuilder.Run()

Commentary

This existing workflow is simple, but there are aspects of these that aren’t ideal for monorepo indexing performance:

  1. We fetch the repository’s git data from scratch for every single indexing job. Even if the latest commit only added a single file, we’d still have to fetch all the data required to construct the entire work tree.
  2. We index every file that’s present in the version of the repository at a given commit, even if that file wasn’t modified in the commit.

One can imagine that large monorepos exacerbate both of these issues.

Delta builds

Our idea is to see if we can leverage previously computed index state to only do an amount of work that’s proportional to the amount of files that changed in the latest commit.

To that end, we’re introducing the concept of delta builds. Unlike normal builds that throw away the previous state and re-index everything from scratch, delta builds work by stacking changes on top of the existing state.

Delta builds produce delta shards, which only contain index data from the changed files in the latest commit.

Delta builds also update the metadata in older shards to indicate that a particular version of a file is “outdated”. This is done with a marker called a file tombstone, which indicates that any search results from a shard that involve this file should be ignored.

Our belief is that only indexing the changed files in a commit should be much less expensive than re-indexing entire repository over-and-over again. As a result, delta builds should make search results from the latest commits available much more quickly that normal builds. However, this scheme introduces a lot of sub-problems that need to be solved.

Sub-problems

1. Making Zoket’s searcher aware of FileTombstones

Zoekt’s searcher needs to be aware of a new field, FileTombstones, and understand how to interpret it + skip over files if they’re tombstoned.

Status:

This was implemented in https://github.com/sourcegraph/zoekt/pull/273. There are aspects of this that will probably need to be made performant for the initial release (see https://github.com/sourcegraph/zoekt/pull/273/files#diff-07fd8e51cdf14130405a961100cecaf6bdc47e746a4d9ea1de1c06422fc7e1fdR248-R251), but this isn’t necessary for the prototype.

2. Making Zoekt’s shard builder properly manipulate the index state for delta builds

Zoekt’s shard builder now needs to know how to do the following for delta builds:

Status

This was implemented in https://github.com/sourcegraph/zoekt/pull/274.

3. Instrumenting the callers of Zoekt’s shard builder for delta builds

For delta builds, Zoekt now needs to:

In addition, Zoekt needs to know under what circumstances it’ll need to “fall back” to a normal build. Examples:

Status

This is complete.

4. Error recovery

Zoekt’s normal builds delete the old state for every indexing job. This meant that if there was some sort of error (either during the build process itself, Zoekt crashed, etc), Zoekt would eventually recover (the old state would eventually be cleared out).

Delta builds can’t delete older shards - their fundamental operation relies on stacking new changes on top of the existing state. Any errors we’ll have won’t be passively cleared out. We’ll need some sort of mechanism to ensure that we’ll never enter this state in the first place.

Status

A stop gap solution has been implemented. See the "Compaction" section for more details, but the stop-gap solution for that sub-problem also addresses this one.

5. Compaction

Each new shard introduces similar overhead / costs in memory usage and search times. This means that search time / latency can bog down if we keep adding more shards. Since Delta shards only add new shards for each commit - this might be a problem for repositories with high commit frequencies. We’ll need to introduce a process to merge delta shards into a single shard.

Status

A stop gap-solution has been implemented. After every 150 shards (the number is configurable), we fall back to a full normal build (which will delete all of the older shards, including the delta ones).

This solution has a number of drawbacks (the fallback might happen at an inopportune time, larger repositories are disproportionately affected, etc.). However, this solution is easy to understand, gives us a free error handling mechanism (since we delete all the older shards), limits the amount of memory pressure that the delta shards can introduce, and clearly still has better overall throughput than the old strategy.

This simple strategy is in a good enough place to where customers could still try this and benefit from it. However, we might need a more robust solution to the compaction problem depending on their feedback.

6. Git fetch costs

The delta build process described above doesn’t address the network cost of fetching all the git data required to construct the repository for a given commit from scratch. (in fact, the way it’s implemented could fetch slightly more git data than before since we need two commits in order to calculate the git diff).

There are a few ways that address this, but the most attractive option is probably modifying gitserver to calculate the diff itself.

Status

Not implemented, and not necessary for the MVP.

We profiled the indexing costs for sgtest/megarepo, and the network transfer time only accounted for 1/3 of the total cost. From that, we decided that optimizing the shard build process is a better use of our time for now.

github-actions[bot] commented 2 years ago

Heads up @jjeffwarner - the "team/search-core" label was applied to this issue.

stefanhengl commented 8 months ago

This issue has been inactive for a long time. To reopen the ticket, please let us know how to reproduce the issue on latest main. For feature requests, please let us know what is still missing.