Core Services: 3.6 Tracking Issue

keegancsmith commented 5 years ago

Goal

99th percentile latency of the search query “print” at 42req/s with 20k repos is under 2s.

Availability

@mrnugget: Not working July 8th-10th
@tsenart All month
@keegancsmith All month

Dependencies

We will want to avoid conflicting with the work the Search team is doing https://github.com/sourcegraph/sourcegraph/issues/4582

Completed

Our main focus in this iteration was to load test the large-scale instance at k8s.sgdev.org and to analyze profiling data to spot bottlenecks and possible improvements.

The first major theme we found were large and needless allocations in the GraphQL layer, that were due to multiple parts of the code relying on a list of all repositories. That caused out-of-memory errors, GC pressure and resulted in increased request latency.

Solving this issue by reducing the number of allocations and changing what we allocate took up a large part of this iteration.

After that, we focused on the next component, Zoekt, by profiling it and trying to understand its behavior under load.

We now have multiple ideas for improving Zoekt's performance, including horizontal scaling.

[x] Start by adding a benchmark for the (r *searchResolver) doResults method in the GraphQL layer
- [x] Use *searchbackend.Zoekt.DisableCache = true in BenchmarkSearchResults
- [x] Create a database-hitting integration version of BenchmarkSearchResults
[x] Add a backend.ListRepoIDs method to be used in resolveRepositories. This could use a roaring bitset to encode the IDs in a compressed way. Update after investigation on June 28th: we cannot use only the repository IDs, because a lot of places in the search code path use the repository name, including gitserver.
[x] externalrepospec not a pointer in types.Repo
[x] Avoiding allocating the repositoryRevision slice if all revisions are empty. (@tsenart)
[x] Before search results are returned from doResults the repositories-IDs need to be turned into repositories ("hydrated"). This requires a backend.LoadByIDs. Update after investigation on June 28th: the easier and "dumb" thing (because it includes N+1 queries) to do would be to load "minimal repositories" with only ID and Name and ExternalRepo fields. Whenever a method on *repositoryResolver that's not ID, Name, etc. is called, we load the information from the database and fill up the *repositoryResolver.Repo
[x] Dogfood traces for common paths so that these are useful (don’t have to jump straight to logs)
[x] Clone status indicator #4120 #3413
- [x] Fix status indicator to show actual cloning status (#4591)
- [x] Follow #4591 up with investigating possible performance improvements described here
- [x] Remove feature flag and enable feature by default
[x] Possible follow-up on #4685: delete all existing repoupdater migrations now that they've been in two versions of Sourcegraph (#4886)
[x] Adding a new external service that syncs slowly results in 504 Gateway Timeout (#4511)

Won't do

Abandoned because at the moment it wouldn't bring us that much closer to our goal, compared to other ideas that we will be pursuing. We might still do this in the future:
- [ ] Fix ACL layer to work with repository IDs instead of names (https://github.com/sourcegraph/sourcegraph/issues/4812)
- [ ] Add support for repository IDs to Zoekt.
- Record sourcegraph ID in zoekt shards
- Support specifying a set of sourcegraph IDs in zoekt/query.Q
- [ ] IDs need to be stable for "restoring of previously deleted repos" to work in repo-updater/Zoekt

Backlogged

[ ] Repositories can end up in bad state and harm overall system #4565
[ ] Introduce some form of rate limiting in the GraphQL API to prevent slowing down requests for everyone.
[ ] Refactor the "scatter/gather" pattern to remove duplication in gitserver.Client (see comment)
[ ] Refactor repoupdater.Client.do method to use the same interface as the bitbucketserver.Client.send method (see comment here)

Moved to 3.7

[ ] Determine resource allocation "equation" and update docs for admins to provision them accordingly for their Sourcegraph installation (e.g. https://confluence.atlassian.com/bitbucketserver/scaling-bitbucket-server-776640073.html)
- Note: We don't know enough yet to do this.
[ ] Experiment with more efficient posting lists in Zoekt https://github.com/sourcegraph/zoekt/pull/10
- Note: Partially done.

tsenart commented 5 years ago

Jun 26

Found possible reason for memory bloat and OOM errors
- When communicating with Zoekt we're sending a map of all repositories
  - Idea: send a blacklist of repositories to Zoekt instead of a whitelist of repositories to look into
  - Idea: encode the map of repositories as a roaring bitset
Upgraded CPU count of indexed-search POD from 2 to 8
Possible idea for future optimizations: make Zoekt horizontally scalable
Discovered that ShardMaxMatchCount in *zoekt.SearchOptions is actually 100000 when we specified count:10000 (10k vs 100k!) in our search query (code here)
Idea: instead of sending repo names to Zoekt, we can load "internal IDs" (do not exist atm) from Zoekt when loading list of indexed repos and then use those when sending the search query
Idea: maybe there's a better way to compress postings in Zoekt then binary.Varint encoding currently being used
- Note: Zoekt doesn't have any benchmarks

mrnugget commented 5 years ago

July 12

Next steps after sync today:

Do a load test to find out what is the maximum number of shards that can be searched with 42req/s
Do load test with less concurrency to find bottlenecks in search
Improve concurrency control in Zoekt to spend less time in the runtime
Horizontal scaling for Zoekt
Idea: optimizy the query simplify method for repo-sets: instead of simplifying in each shard, only call the shards that are in reposet.

mrnugget commented 5 years ago

I just ran benchstat on two benchmark outputs, comparing ed8eb447b46, which introduced BenchmarkIntegrationSearchResults and 008661b03b, the current state of master:

benchstat ~/ed8eb447b46.txt ~/008661b03b.txt
name                        old time/op    new time/op    delta
IntegrationSearchResults-8    27.5ms ± 6%    22.8ms ±14%  -17.22%  (p=0.016 n=5+5)

name                        old alloc/op   new alloc/op   delta
IntegrationSearchResults-8    8.13MB ± 0%    6.27MB ± 0%  -22.85%  (p=0.008 n=5+5)

name                        old allocs/op  new allocs/op  delta
IntegrationSearchResults-8      166k ± 0%      131k ± 0%  -21.08%  (p=0.008 n=5+5)

(Keep in mind though that this benchmark doesn't reflect some of the performance work done on the ACL layer and in #4667 & #4688 which were merged before)

Edit:

If I go back to ed8eb447b46 and manually revert the changes made in #4667 and #4688 as best as I can (which is not really good, because I think my attempt at manual reverting leads to stack-only allocations of the removed repoResolvers, which were allocated on the heap), then I get this:

λ benchstat ~/ed8eb447b46-with-pointers-and-repoResolvers-reverted.txt ~/008661b03b.txt
name                        old time/op    new time/op    delta
IntegrationSearchResults-8    30.5ms ±11%    22.8ms ±14%  -25.30%  (p=0.008 n=5+5)

name                        old alloc/op   new alloc/op   delta
IntegrationSearchResults-8    9.52MB ± 0%    6.27MB ± 0%  -34.17%  (p=0.008 n=5+5)

name                        old allocs/op  new allocs/op  delta
IntegrationSearchResults-8      181k ± 0%      131k ± 0%  -27.60%  (p=0.016 n=4+5)

tsenart commented 5 years ago

This doesn't account for authzFilter because the benchmark doesn't go through that codepath. The difference would be larger if so! EDIT: Noticed you mentioned this as well :-)

tsenart commented 5 years ago

I did a 10 minutes load test at 75 QPS ("print index:only") locally with all the github.com/sourcegraph org repos. Things are looking far less spiky in v3.6.0-rc.2 with a 99th latency percentile reduction of about 78%. So, when Zoekt isn't a bottleneck, things got much better :-)

$ vegeta report v3.5.2.json
Requests      [total, rate]            45000, 75.00
Duration      [total, attack, wait]    10m0.031200469s, 9m59.986700822s, 44.499647ms
Latencies     [mean, 50, 95, 99, max]  188.629834ms, 68.529277ms, 693.263509ms, 1.257987087s, 1.363996658s
Bytes In      [total, mean]            1543521571, 34300.48
Bytes Out     [total, mean]            174420000, 3876.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:45000
Error Set:

$ vegeta report v3.6.0-rc.2.json
Requests      [total, rate]            45000, 75.00
Duration      [total, attack, wait]    10m0.023251358s, 9m59.987078086s, 36.173272ms
Latencies     [mean, 50, 95, 99, max]  65.447395ms, 50.070781ms, 150.973882ms, 406.154921ms, 554.35399ms
Bytes In      [total, mean]            1531149906, 34025.55
Bytes Out     [total, mean]            174420000, 3876.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:45000
Error Set:

vegeta-plot

tsenart commented 5 years ago

search

tsenart commented 5 years ago

We ran a 10 minute load test against 3.5.2 and 3.6.0-rc.2 in a large VM with 64 cores and 120GB of RAM. Sourcegraph was ran with a single Docker image and loaded with all repos from the Microsoft and Sourcegraph orgs on Github. An additional Bitbucket server external service was added with authorization on so that the authzFilter code path would be included.

After all repos finished indexing (which takes a while), we created a non-admin user and got an API token. From a separate VM, we ran two load tests, one against each version of Sourcegraph. body.json contains the GraphQL query JSON extracted from the Chrome inspector after doing a search via the UI.

echo "POST http://load-test-machine/.api/graphql" | vegeta attack -name=3.5.2 -rate=50 -duration=10m -header 'Authorization: token YOUR_TOKEN' -body=body.json | tee 3.5.2.bin | vegeta report --every=1s

After both load tests finish (requiring changing which version of Sourcegraph is running in between), we can inspect the results with vegeta report.

vegeta report v3.6.0.bin
Requests      [total, rate]            45000, 75.00
Duration      [total, attack, wait]    10m0.023251358s, 9m59.987078086s, 36.173272ms
Latencies     [mean, 50, 95, 99, max]  65.447395ms, 50.070781ms, 150.973882ms, 406.154921ms, 554.35399ms
Bytes In      [total, mean]            1531149906, 34025.55
Bytes Out     [total, mean]            174420000, 3876.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:45000
Error Set:

It's good to also verify if there are any outliers and peaks with vegeta plot.

To create that histogram, we use Python and bit of pre-processing of data.

vegeta encode -to json 3.5.2.bin | jq -c '.latency' > before.csv
vegeta encode -to json 3.6.0.bin | jq -c '.latency' > after.csv

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)
sns.set(rc={'figure.figsize': (20, 10)})
sns.set_style("whitegrid")
sns.axes_style(style="whitegrid")

before = pd.read_csv("before.csv").divide(1e6)
after  = pd.read_csv("after.csv").divide(1e6)

sns.distplot(
    before,
    color="red",
    norm_hist=False,
    rug=True,
    label="3.5.2",
    axlabel="Latency (ms)"
)

sns.distplot(
    after,
    color="skyblue",
    norm_hist=True,
    rug=True,
    label="3.6.0",
    axlabel="Latency (ms)"
)

plt.legend()
plt.title("Sourcegraph indexed search latency distribution at 50 req/s with 3k repos\n", fontdict = {'fontsize' : 25})
plt.savefig("hist.png")

sourcegraph / sourcegraph-public-snapshot