Speed up full reindex - Githubissues

mitjat commented 2 years ago

Try to speed up reindexing of the whole chain by sharding blocks among multiple analyzers.

First, test if the node scales well to multiple clients; just run 2 or 3 analyzers locally (each with a different block range) and eyeball speed.

Not crucial right now, but needed for a functional prod deploy; we cannot afford for our response to a hypothetical needed hotfix to be "we'll be back online in 3-4 days".

mitjat commented 2 years ago

This DB update statemetn looks order-dependent and prevents us from processing blocks out of order. Maybe we can change the order query so that the second operand to +/- is derived from the tx or event only?

aefhm commented 2 years ago

order-dependent

By that, you don't mean within the same height right? https://github.com/oasisprotocol/oasis-indexer/blob/4aa5a529147586a1316c2c5f25dd7ccc56e1ac58/analyzer/queries.go#L209-L210

Maybe we can change the order query so that the second operand to +/- is derived from the tx or event only?

I think that is definitely preferred when possible. Hmm, balance in general I believe is always a fractional ownership of the shares of the total escrowed or debonding amount.

mitjat commented 2 years ago

Nit: The initial value of escrow_balance_active (when inserting the row in the DB for the first time) comes from NewShares, which is a Quantity (= int), not something fractional (as in a decimal number).

I meant order-dependent in the sense of the order of blocks, yes.

The more I dig into this one query, the more I uncover. Forked the discussion off into #192 so this ticket stays focused on speeding up the reindex.

My current take is that we can probably ignore this one SQL query for now because we never(?)/rarely slash so the query doesn't really get used, and the query seems buggy in its current form anyway. So let's not block indexer parallelization on fixing this one query.

aefhm commented 1 year ago

My current take is that we can probably ignore this one SQL query for now because we never(?)/rarely slash so the query doesn't really get used, and the query seems buggy in its current form anyway. So let's not block indexer parallelization on fixing this one query.

Fair enough.

Nit: The initial value of escrow_balance_active (when inserting the row in the DB for the first time) comes from NewShares, which is a Quantity (= int), not something fractional (as in a decimal number).

Uh, I think it comes from the amount?

https://github.com/oasisprotocol/oasis-indexer/blob/fdd9f4105985ca5b6466dfa787ecd6ddb3da1942/analyzer/consensus/consensus.go#L715-L730

mitjat commented 1 year ago

Parallelization works to some extent, but less well than I hoped it would.

I configured the indexer to run multiple analyzers (took some hacky work; the assumption that there is only ever one analyzer of a given type is baked in in several places); here's what I got:

analyzers	blocks per minute
1	133
2	211
4	119 and 222 (two separate runs)

All analyzers ran in the 8_000_000 to 8_500_000 block range, consensus only. Each analyzer was assigned a custom 100k range. This was tested on the node via a kubectl port forward; the latter might have brought additional IO issues. (Edit: Definitely. See below.)

There were two reasons that I saw for the slow-down when adding 4+ analyzers:

Connection contention(?). Maybe there's rate limiting from the node, or maybe the kubectl proxy does not handle parallel connections well (which sounds unreasonable, but proxy behavior has been flaky for me), or maybe there's something else. Either way, with more analyzers, there's more failed to receive server preface within timeout messages (each of which freezes an analyzer for 20 seconds, which is the duration of the timeout).
Failed DB transactions because of deadlock.

Fetching individual blocks from the node takes longer the more analyzers there are. The eyeballed median is 0.04s for one analyzer, 0.06s for two and four analyzers. But there is lots of variance; eyeballed p90=0.1s and p99=0.3s. (Note: We fetch a lot more than just the block data for each height)

Experiments to follow:

Move the experiment to k8s for more realistic network IO. Looking at the single-analyzer indxer in prod, it's doing about 300 blocks/s.
Run against two different oasis-nodes
(probably not) Reorder SQL queries in each batch alphabetically (! - super gross; I think it doesn't violate any current assumptions, but it's a footgun) to potentially reduce deadlocking txs

mitjat commented 1 year ago

The only problematic, not-analyzable-out-of-order event (TakeEscrowEvent) from the second comment above has been made analyzable out of order by adding an additional field to the Event in https://github.com/oasisprotocol/oasis-core/pull/5016. That field will not be present for already-generated Events though, so we cannot parallelize the full reindex for now :/

FWIW, here are the numbers on parallelization in a more realistic setup, i.e. in k8s:

analyzers	blocks per minute
1	948
2	1363
4	834
8	416

For posterity, the above was obtained with (technically, counts processed blocks in the last minute of logs):

for n in 1 2 4 8; do f=par${n}.3; start="$(TZ=UTC date --date "$(tail "$f" -n1 | cut -d' ' -f1) -1min" --rfc-3339=ns | tr ' ' T)"; cat "$f" | awk '$1 > "'"$start"'"' | wc -l; done;

Observation: oasis-node is not great at handling parallel connections:

backing node load remained stable at 0.25 CPU even with 8 connections to the node
but time to fetch data (only block fetching tracked) increased sharply

analyzers	time to fetch block data with `GetBlock` (this is just one of the gRPC calls, I didn't instrument the others)
1	0.0139342
2	0.0201103
4	0.0775929
8	0.078754

The above was obtained from logs with

less par1.3 | cut -d' ' -f2- | grep 'block fetched in' | cut -d' ' -f4 | avg

Next steps:

instrument node latency globally to better understand bottlenecks and for long-term health/stability monitoring (https://github.com/oasisprotocol/oasis-indexer/issues/205)
run indexer via a profiler to better understand bottlenecks

mitjat commented 1 year ago

https://app.clickup.com/t/3ufk8j6

mitjat commented 1 year ago

Ran some CPU profiling. Cropped CPU flamegraph:

No major surprises there:

CPU is largely idle
Most of the time is spent decoding CBOR (and to a lesser extent, speaking the grpc/http/postgres protocols)
Minor surprise: About 18% of the total (CPU) time and about 35% of block-analysis time is spent on parsing the registry data. That's a lot. If these percentages transfer over to wall time, it's worth thinking about more. Since the indexer does not expose historic/provenance data (unlike with balances, where we expose transactions), we could probably skip fetching and parsing registry data on a full reindex, and just pull it from a GenesisAtHeight at the end?

Inspect for yourself: indexer-cpu.log (see commands below on how to visualize)

Notes:

Code to run profiling

CPU profiling: Added to topmost main.go:

    f, err := os.Create("/tmp/indexer-cpu.log")
    if err != nil {
        fmt.Fprintf(os.Stderr, "Cannot instantiate profiling: %+v", err)
    }
    runtime.SetCPUProfileRate(100)
    pprof.StartCPUProfile(f)
    defer pprof.StopCPUProfile() // doesn't ever run because indexer never finishes; below code makes it stop profiling on ctrl+C

    c := make(chan os.Signal, 2)
    signal.Notify(c, os.Interrupt, syscall.SIGQUIT) // subscribe to system signals
    onKill := func(c chan os.Signal) {
        select {
        case <-c:
            fmt.Fprintf(os.Stderr, "BANZAIIIIIIIIIIIII")
            pprof.StopCPUProfile()
            os.Exit(0)
        }
    }
    go onKill(c)

Ran indexer as usual, killed it with ctrl+C. Visualized the profiling info with go tool pprof -http=":9999" ./oasis-indexer /tmp/indexer-cpu.log

pro-wh commented 1 year ago

why crop
cbor not showing up anywhere in the screenshot

On Tue, Nov 22, 2022, 4:47 PM mitjat @.***> wrote:

Ran some CPU profiling. Cropped CPU flamegraph: [image: image] https://user-images.githubusercontent.com/629970/203448819-c5731a14-ebf1-4553-b55b-1907c7c7df24.png

No major surprises there:

CPU is largely idle

Most of the time is spent decoding CBOR (and to a lesser extent, speaking the grpc/http/postgres protocols)

Minor surprise: About 18% of the total (CPU) time and about 35% of block-analysis time is spent on parsing the registry data. That's a lot. If these percentages transfer over to wall time, it's worth considering the tradeoffs of pulling the registry data wholesale once every N rounds or seconds.

Inspect for yourself: indexer-cpu.log https://github.com/oasisprotocol/oasis-indexer/files/10072093/indexer-cpu.log (see commands below on how to visualize)

Notes: Code to run profiling CPU profiling: Added to topmost main.go:

f, err := os.Create("/tmp/indexer-cpu.log") if err != nil { fmt.Fprintf(os.Stderr, "Cannot instantiate profiling: %+v", err) } runtime.SetCPUProfileRate(100) pprof.StartCPUProfile(f) defer pprof.StopCPUProfile() // doesn't ever run because indexer never finishes; below code makes it stop profiling on ctrl+C

c := make(chan os.Signal, 2) signal.Notify(c, os.Interrupt, syscall.SIGQUIT) // subscribe to system signals onKill := func(c chan os.Signal) { select { case <-c: fmt.Fprintf(os.Stderr, "BANZAIIIIIIIIIIIII") pprof.StopCPUProfile() os.Exit(0) } } go onKill(c)

Ran indexer as usual, killed it with ctrl+C. Visualized the profiling info with go tool pprof -http=":9999" ./oasis-indexer /tmp/indexer-cpu.log

— Reply to this email directly, view it on GitHub https://github.com/oasisprotocol/oasis-indexer/issues/179#issuecomment-1324413402, or unsubscribe https://github.com/notifications/unsubscribe-auth/AJDVILQ7GLKEYXD5WSVAASLWJVSSFANCNFSM6AAAAAARE57LS4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Disclaimer: This e-mail and any attachments may contain confidential information. If you are not the intended recipient, any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and any attachments without reading or saving.

pro-wh commented 1 year ago

reran the above log to post an uncropped pic :skull:

mitjat commented 1 year ago

Each round spends about this much time in IO wait:

postgres: 19ms
oasis-node: 640ms

This is not wall time, and we parallelize some oasis-node requests, so there's some speedup there. Also, this is on my local laptop with a k8s tunnel to the prod node and a local postgres. In prod, I see postgres latencies closer to 25ms. Still, same ballpark.

If we have to keep pg writes sequential (because of deadlocks when updating e.g. balances), we're looking at a max possible speed of (1000ms)/(25ms) = 40 blocks per second. That's very roughly (10e6 blocks)/40/3600=70h for a full reindex, and slightly slower if we include emerald parsing. But also, I hope we can avoid creating competing db writes during the fast full rescan, and do much better for time.

Steps to trace each round

```diff --- a/analyzer/consensus/consensus.go +++ b/analyzer/consensus/consensus.go @@ -160,6 +161,11 @@ func (m *Main) Start() { backoff.Wait() m.logger.Info("attempting block", "height", height) + f, perr := os.Create(fmt.Sprintf("/tmp/consensus-%d.trace", height)) + if perr != nil { + panic(perr) + } + trace.Start(f) if err := m.processBlock(ctx, height); err != nil { if err == analyzer.ErrOutOfRange { m.logger.Info("no data available; will retry", @@ -175,6 +181,8 @@ func (m *Main) Start() { backoff.Failure() continue } + trace.Stop() + f.Close() m.logger.Info("processed block", "height", height) backoff.Success() ``` Run analyzer as usual. View trace: `go tool trace /tmp/consensus-8092108.trace` opens a web page. For network blocking times, see "Network blocking profile". I didn't find much use for the traditional main trace view (first link in interface) because it's just a bunch of white space (= waiting).

aefhm commented 1 year ago

About 18% of the total (CPU) time and about 35% of block-analysis time is spent on parsing the registry data.

Yikes. Agree that this is probably not worth optimizing right now though.

mitjat commented 1 year ago

Plan:

Make the indexer able to run in two modes: "fast-rescan" and "linear".
"linear" processes blocks sequentially. It's what we have today.
"fast-rescan" processes blocks in parallel
- Because it cannot calculate debonding vs active escrow correctly, it will ignore those updates, and sync from GenesisAtHeight in the end
- In fact, it will ignore (and ideally never fetch) any updates for tables that get populated by GenesisAtHeight: entities,nodes,runtimes,accounts,commissions,allowances,delegations,debonding_delegations,proposals,votes
- We have no GenesisAtHeight for paratimes :(. Luckily (and unlike consensus), all updates are commutative, so the only problem is DB lock contention.
- Disables DB constraints while it runs (FKs, nonnegative balances), re-enables them when it finishes
Config file specifies a block range to index using the "fast-rescan" mode, and another for the "linear" mode.
"fast-rescan" high-level execution:
1. Disable DB constraints (FKs, nonnegative balances)
2. Fetch IDs of yet-unindexed blocks, feed them to a pool of worker goroutines
3. Each worker goroutine:
  1. Fetches all required data into a large struct composed of oasis-core structs (as discussed on slack)
  2. Processes the large struct, produces queries. The parts of processing only touch genesis-populated tables (see above) are skipped.
  3. Submits queries to DB. I'll try this single-threaded (to prevent competing db txs), though some parallelism would be welcome (see previous comment; single-threaded, we can't do better than 40 blocks/s)
4. Populate/update the relevant tables using GenesisAtHeight data
5. Re-enable DB constraints

pro-wh commented 1 year ago

for token support, I want to run paratime queries during analysis. this is to get token symbol and other data

Edit [Mitja]: Discussed in Slack, we're leaning towards fetching all this data in a separate, out-of-band "analyzer" / data source.

mitjat commented 1 year ago

Some more pointers to myself: During reindex:

ALTER TABLE SET UNLOGGED
alter column types to avoid negativity checks
disable code paths that do dead reckoning; instead, pull balances etc at the end
disable DB indexes like so (?); might do best with manually-selected indexes so that FK enforcement does not become super slow

oasisprotocol / nexus

Speed up full reindex #179