Closed keegancsmith closed 5 years ago
indexed-search
POD from 2
to 8
ShardMaxMatchCount
in *zoekt.SearchOptions
is actually 100000
when we specified count:10000
(10k vs 100k!) in our search query (code here)binary.Varint
encoding currently being used
Next steps after sync today:
simplify
method for repo-sets: instead of simplifying in each shard, only call the shards that are in reposet.I just ran benchstat
on two benchmark outputs, comparing ed8eb447b46, which introduced BenchmarkIntegrationSearchResults
and 008661b03b, the current state of master
:
benchstat ~/ed8eb447b46.txt ~/008661b03b.txt
name old time/op new time/op delta
IntegrationSearchResults-8 27.5ms ± 6% 22.8ms ±14% -17.22% (p=0.016 n=5+5)
name old alloc/op new alloc/op delta
IntegrationSearchResults-8 8.13MB ± 0% 6.27MB ± 0% -22.85% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
IntegrationSearchResults-8 166k ± 0% 131k ± 0% -21.08% (p=0.008 n=5+5)
(Keep in mind though that this benchmark doesn't reflect some of the performance work done on the ACL layer and in #4667 & #4688 which were merged before)
Edit:
If I go back to ed8eb447b46 and manually revert the changes made in #4667 and #4688 as best as I can (which is not really good, because I think my attempt at manual reverting leads to stack-only allocations of the removed repoResolvers
, which were allocated on the heap), then I get this:
λ benchstat ~/ed8eb447b46-with-pointers-and-repoResolvers-reverted.txt ~/008661b03b.txt
name old time/op new time/op delta
IntegrationSearchResults-8 30.5ms ±11% 22.8ms ±14% -25.30% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
IntegrationSearchResults-8 9.52MB ± 0% 6.27MB ± 0% -34.17% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
IntegrationSearchResults-8 181k ± 0% 131k ± 0% -27.60% (p=0.016 n=4+5)
This doesn't account for authzFilter
because the benchmark doesn't go through that codepath. The difference would be larger if so! EDIT: Noticed you mentioned this as well :-)
I did a 10 minutes load test at 75 QPS ("print index:only") locally with all the github.com/sourcegraph org repos. Things are looking far less spiky in v3.6.0-rc.2
with a 99th latency percentile reduction of about 78%. So, when Zoekt isn't a bottleneck, things got much better :-)
$ vegeta report v3.5.2.json
Requests [total, rate] 45000, 75.00
Duration [total, attack, wait] 10m0.031200469s, 9m59.986700822s, 44.499647ms
Latencies [mean, 50, 95, 99, max] 188.629834ms, 68.529277ms, 693.263509ms, 1.257987087s, 1.363996658s
Bytes In [total, mean] 1543521571, 34300.48
Bytes Out [total, mean] 174420000, 3876.00
Success [ratio] 100.00%
Status Codes [code:count] 200:45000
Error Set:
$ vegeta report v3.6.0-rc.2.json
Requests [total, rate] 45000, 75.00
Duration [total, attack, wait] 10m0.023251358s, 9m59.987078086s, 36.173272ms
Latencies [mean, 50, 95, 99, max] 65.447395ms, 50.070781ms, 150.973882ms, 406.154921ms, 554.35399ms
Bytes In [total, mean] 1531149906, 34025.55
Bytes Out [total, mean] 174420000, 3876.00
Success [ratio] 100.00%
Status Codes [code:count] 200:45000
Error Set:
We ran a 10 minute load test against 3.5.2
and 3.6.0-rc.2
in a large VM with 64 cores and 120GB of RAM. Sourcegraph was ran with a single Docker image and loaded with all repos from the Microsoft and Sourcegraph orgs on Github. An additional Bitbucket server external service was added with authorization
on so that the authzFilter
code path would be included.
After all repos finished indexing (which takes a while), we created a non-admin user and got an API token. From a separate VM, we ran two load tests, one against each version of Sourcegraph. body.json
contains the GraphQL query JSON extracted from the Chrome inspector after doing a search via the UI.
echo "POST http://load-test-machine/.api/graphql" | vegeta attack -name=3.5.2 -rate=50 -duration=10m -header 'Authorization: token YOUR_TOKEN' -body=body.json | tee 3.5.2.bin | vegeta report --every=1s
After both load tests finish (requiring changing which version of Sourcegraph is running in between), we can inspect the results with vegeta report.
vegeta report v3.6.0.bin
Requests [total, rate] 45000, 75.00
Duration [total, attack, wait] 10m0.023251358s, 9m59.987078086s, 36.173272ms
Latencies [mean, 50, 95, 99, max] 65.447395ms, 50.070781ms, 150.973882ms, 406.154921ms, 554.35399ms
Bytes In [total, mean] 1531149906, 34025.55
Bytes Out [total, mean] 174420000, 3876.00
Success [ratio] 100.00%
Status Codes [code:count] 200:45000
Error Set:
It's good to also verify if there are any outliers and peaks with vegeta plot
.
To create that histogram, we use Python and bit of pre-processing of data.
vegeta encode -to json 3.5.2.bin | jq -c '.latency' > before.csv
vegeta encode -to json 3.6.0.bin | jq -c '.latency' > after.csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
sns.set(color_codes=True)
sns.set(rc={'figure.figsize': (20, 10)})
sns.set_style("whitegrid")
sns.axes_style(style="whitegrid")
before = pd.read_csv("before.csv").divide(1e6)
after = pd.read_csv("after.csv").divide(1e6)
sns.distplot(
before,
color="red",
norm_hist=False,
rug=True,
label="3.5.2",
axlabel="Latency (ms)"
)
sns.distplot(
after,
color="skyblue",
norm_hist=True,
rug=True,
label="3.6.0",
axlabel="Latency (ms)"
)
plt.legend()
plt.title("Sourcegraph indexed search latency distribution at 50 req/s with 3k repos\n", fontdict = {'fontsize' : 25})
plt.savefig("hist.png")
Goal
99th percentile latency of the search query “print” at 42req/s with 20k repos is under 2s.
Availability
Dependencies
We will want to avoid conflicting with the work the Search team is doing https://github.com/sourcegraph/sourcegraph/issues/4582
Completed
Our main focus in this iteration was to load test the large-scale instance at k8s.sgdev.org and to analyze profiling data to spot bottlenecks and possible improvements.
The first major theme we found were large and needless allocations in the GraphQL layer, that were due to multiple parts of the code relying on a list of all repositories. That caused out-of-memory errors, GC pressure and resulted in increased request latency.
Solving this issue by reducing the number of allocations and changing what we allocate took up a large part of this iteration.
After that, we focused on the next component, Zoekt, by profiling it and trying to understand its behavior under load.
We now have multiple ideas for improving Zoekt's performance, including horizontal scaling.
(r *searchResolver) doResults
method in the GraphQL layer*searchbackend.Zoekt.DisableCache = true
inBenchmarkSearchResults
BenchmarkSearchResults
backend.ListRepoIDs
method to be used inresolveRepositories
. This could use a roaring bitset to encode the IDs in a compressed way. Update after investigation on June 28th: we cannot use only the repository IDs, because a lot of places in the search code path use the repository name, includinggitserver
.doResults
the repositories-IDs need to be turned into repositories ("hydrated"). This requires abackend.LoadByIDs
. Update after investigation on June 28th: the easier and "dumb" thing (because it includes N+1 queries) to do would be to load "minimal repositories" with onlyID
andName
andExternalRepo
fields. Whenever a method on*repositoryResolver
that's not ID, Name, etc. is called, we load the information from the database and fill up the*repositoryResolver.Repo
repoupdater
migrations now that they've been in two versions of Sourcegraph (#4886)Won't do
zoekt/query.Q
Backlogged
gitserver.Client
(see comment)repoupdater.Client.do
method to use the same interface as thebitbucketserver.Client.send
method (see comment here)Moved to 3.7