sourcegraph / sourcegraph-public-snapshot

Code AI platform with Code Search & Cody
https://sourcegraph.com
Other
10.1k stars 1.29k forks source link

Core Services: 3.6 Tracking Issue #4598

Closed keegancsmith closed 5 years ago

keegancsmith commented 5 years ago

Goal

99th percentile latency of the search query “print” at 42req/s with 20k repos is under 2s.

Availability

Dependencies

We will want to avoid conflicting with the work the Search team is doing https://github.com/sourcegraph/sourcegraph/issues/4582

Completed

Our main focus in this iteration was to load test the large-scale instance at k8s.sgdev.org and to analyze profiling data to spot bottlenecks and possible improvements.

The first major theme we found were large and needless allocations in the GraphQL layer, that were due to multiple parts of the code relying on a list of all repositories. That caused out-of-memory errors, GC pressure and resulted in increased request latency.

Solving this issue by reducing the number of allocations and changing what we allocate took up a large part of this iteration.

After that, we focused on the next component, Zoekt, by profiling it and trying to understand its behavior under load.

We now have multiple ideas for improving Zoekt's performance, including horizontal scaling.

Won't do

Backlogged

Moved to 3.7

tsenart commented 5 years ago

Jun 26

mrnugget commented 5 years ago

July 12

Next steps after sync today:

mrnugget commented 5 years ago

I just ran benchstat on two benchmark outputs, comparing ed8eb447b46, which introduced BenchmarkIntegrationSearchResults and 008661b03b, the current state of master:

benchstat ~/ed8eb447b46.txt ~/008661b03b.txt
name                        old time/op    new time/op    delta
IntegrationSearchResults-8    27.5ms ± 6%    22.8ms ±14%  -17.22%  (p=0.016 n=5+5)

name                        old alloc/op   new alloc/op   delta
IntegrationSearchResults-8    8.13MB ± 0%    6.27MB ± 0%  -22.85%  (p=0.008 n=5+5)

name                        old allocs/op  new allocs/op  delta
IntegrationSearchResults-8      166k ± 0%      131k ± 0%  -21.08%  (p=0.008 n=5+5)

(Keep in mind though that this benchmark doesn't reflect some of the performance work done on the ACL layer and in #4667 & #4688 which were merged before)


Edit:

If I go back to ed8eb447b46 and manually revert the changes made in #4667 and #4688 as best as I can (which is not really good, because I think my attempt at manual reverting leads to stack-only allocations of the removed repoResolvers, which were allocated on the heap), then I get this:

λ benchstat ~/ed8eb447b46-with-pointers-and-repoResolvers-reverted.txt ~/008661b03b.txt
name                        old time/op    new time/op    delta
IntegrationSearchResults-8    30.5ms ±11%    22.8ms ±14%  -25.30%  (p=0.008 n=5+5)

name                        old alloc/op   new alloc/op   delta
IntegrationSearchResults-8    9.52MB ± 0%    6.27MB ± 0%  -34.17%  (p=0.008 n=5+5)

name                        old allocs/op  new allocs/op  delta
IntegrationSearchResults-8      181k ± 0%      131k ± 0%  -27.60%  (p=0.016 n=4+5)
tsenart commented 5 years ago

This doesn't account for authzFilter because the benchmark doesn't go through that codepath. The difference would be larger if so! EDIT: Noticed you mentioned this as well :-)

tsenart commented 5 years ago

I did a 10 minutes load test at 75 QPS ("print index:only") locally with all the github.com/sourcegraph org repos. Things are looking far less spiky in v3.6.0-rc.2 with a 99th latency percentile reduction of about 78%. So, when Zoekt isn't a bottleneck, things got much better :-)

$ vegeta report v3.5.2.json
Requests      [total, rate]            45000, 75.00
Duration      [total, attack, wait]    10m0.031200469s, 9m59.986700822s, 44.499647ms
Latencies     [mean, 50, 95, 99, max]  188.629834ms, 68.529277ms, 693.263509ms, 1.257987087s, 1.363996658s
Bytes In      [total, mean]            1543521571, 34300.48
Bytes Out     [total, mean]            174420000, 3876.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:45000
Error Set:

$ vegeta report v3.6.0-rc.2.json
Requests      [total, rate]            45000, 75.00
Duration      [total, attack, wait]    10m0.023251358s, 9m59.987078086s, 36.173272ms
Latencies     [mean, 50, 95, 99, max]  65.447395ms, 50.070781ms, 150.973882ms, 406.154921ms, 554.35399ms
Bytes In      [total, mean]            1531149906, 34025.55
Bytes Out     [total, mean]            174420000, 3876.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:45000
Error Set:

vegeta-plot

tsenart commented 5 years ago

search

tsenart commented 5 years ago

We ran a 10 minute load test against 3.5.2 and 3.6.0-rc.2 in a large VM with 64 cores and 120GB of RAM. Sourcegraph was ran with a single Docker image and loaded with all repos from the Microsoft and Sourcegraph orgs on Github. An additional Bitbucket server external service was added with authorization on so that the authzFilter code path would be included.

After all repos finished indexing (which takes a while), we created a non-admin user and got an API token. From a separate VM, we ran two load tests, one against each version of Sourcegraph. body.json contains the GraphQL query JSON extracted from the Chrome inspector after doing a search via the UI.

echo "POST http://load-test-machine/.api/graphql" | vegeta attack -name=3.5.2 -rate=50 -duration=10m -header 'Authorization: token YOUR_TOKEN' -body=body.json | tee 3.5.2.bin | vegeta report --every=1s

After both load tests finish (requiring changing which version of Sourcegraph is running in between), we can inspect the results with vegeta report.

vegeta report v3.6.0.bin
Requests      [total, rate]            45000, 75.00
Duration      [total, attack, wait]    10m0.023251358s, 9m59.987078086s, 36.173272ms
Latencies     [mean, 50, 95, 99, max]  65.447395ms, 50.070781ms, 150.973882ms, 406.154921ms, 554.35399ms
Bytes In      [total, mean]            1531149906, 34025.55
Bytes Out     [total, mean]            174420000, 3876.00
Success       [ratio]                  100.00%
Status Codes  [code:count]             200:45000
Error Set:

It's good to also verify if there are any outliers and peaks with vegeta plot.

To create that histogram, we use Python and bit of pre-processing of data.

vegeta encode -to json 3.5.2.bin | jq -c '.latency' > before.csv
vegeta encode -to json 3.6.0.bin | jq -c '.latency' > after.csv
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set(color_codes=True)
sns.set(rc={'figure.figsize': (20, 10)})
sns.set_style("whitegrid")
sns.axes_style(style="whitegrid")

before = pd.read_csv("before.csv").divide(1e6)
after  = pd.read_csv("after.csv").divide(1e6)

sns.distplot(
    before,
    color="red",
    norm_hist=False,
    rug=True,
    label="3.5.2",
    axlabel="Latency (ms)"
)

sns.distplot(
    after,
    color="skyblue",
    norm_hist=True,
    rug=True,
    label="3.6.0",
    axlabel="Latency (ms)"
)

plt.legend()
plt.title("Sourcegraph indexed search latency distribution at 50 req/s with 3k repos\n", fontdict = {'fontsize' : 25})
plt.savefig("hist.png")