Pruning state after finalization is too slow

status-im / nimbus-eth2

Nim implementation of the Ethereum Beacon Chain

https://nimbus.guide

Other

543 stars 233 forks source link

Pruning state after finalization is too slow #2440

Closed mratsim closed 3 years ago

mratsim commented 3 years ago

Moving pruning to slot end was implemented in #2384 and #2422 closing #2381

Pruning the DAG is kept immediately after finalization so that attestations targeting invalid or already finalized chains are dropped right away.
Pruning the state is done at slot end and now shows up as a very large bottleneck. Trying to prune only part of it will not be enough.
Pruning the epoch refs is done at slot end as well, they can be cleaned up in one go.

On a high-end NVMe SSD

It takes 220 milliseconds to prune after finalization even though only very few states should have been accumulated.

tersec commented 3 years ago

delState now does two SQL deletes, to handle both full states and states without immutable validator information: https://github.com/status-im/nimbus-eth2/blob/baa43ae7f7f58807f2bf5308f8ceb420aae9d3fe/beacon_chain/beacon_chain_db.nim#L400-L403

It'd require refactoring abstractions, but that could be reduced to one conceptually readily.

A better approach would involve explicitly handling transactions, rather than the implicit, per-SQL-statement (per-state-delete x2) transactions currently used.

It'd be interesting to see sort of a null-transaction-benchmark. Small k/v store objects being just deleted en masse, with/without separate per-SQL-statement (implicit) transactions. If that's not the problem, then neither of the above issues is worth pursuing for this issue's sake.

arnetheduck commented 3 years ago

played with this script on a mainnet database copy:

echo -e 'select length(value) from kvstore limit 1;\n.timer on\nDELETE FROM kvstore WHERE key = (SELECT key FROM kvstore LIMIT 1 );' | sqlite3 nbc.sqlite3 >> timings.txt

it deletes rows one at a time and prints how long it took - in a for loop it looks like for i in {0..100000}; do sh test.sh ; done

then massage it into single rows:

sed 'N;s/\n/ /' timings.txt | sed "s/Run Time: real //" | sed "s/user //" | sed "s/sys //" > timings2.csv

then it can be opened in openoffice calc (I wanted to excel here!):

Screenshot from 2021-03-23 08-45-21

random big blobs:

9631747 0.328 0.042149 0.076036
8613696 0.074 0.022290 0.033821
10047540 0.084 0.019399 0.050170
3088841 0.222 0.027514 0.062269
6019100 0.323 0.027789 0.083814
8741445 0.078 0.018163 0.043314
3471708 0.192 0.020461 0.059802
8889780 0.512 0.041763 0.118680
6196252 0.272 0.031240 0.086068
8074031 0.067 0.016016 0.038331
10011870 0.291 0.029829 0.084682

random small blobs:

17663 0.012 0.000747 0.000467
3484 0.007 0.000948 0.000000
1912 0.008 0.001126 0.000021
15669 0.008 0.000007 0.001080
4663 0.008 0.000993 0.000000
2410 0.008 0.000000 0.000917
2104 0.008 0.000519 0.000534
5552 0.008 0.000436 0.000579
4789 0.008 0.000078 0.000964
4139 0.008 0.000564 0.000423
4180 0.009 0.000658 0.000494
6777 0.010 0.000825 0.000000
3977 0.008 0.000498 0.000000
1864 0.007 0.000231 0.000116
14452 0.007 0.000000 0.000438
4836 0.007 0.000000 0.000558
1638 0.008 0.000000 0.000954
4389 0.008 0.000670 0.000269
2727 0.008 0.000038 0.001026
4923 0.008 0.000731 0.000367
3920 0.009 0.000052 0.000946
2515 0.008 0.000000 0.000981
8640 0.008 0.001005 0.000167
5260 0.008 0.000485 0.000633
3982 0.007 0.000930 0.000000
16369 0.008 0.001102 0.000025
1969 0.007 0.000000 0.000918
2943 0.009 0.000000 0.001265
2271 0.009 0.000098 0.001015
3669 0.007 0.000000 0.000882
6915 0.007 0.000000 0.000967
14921 0.008 0.001046 0.000000
2246 0.007 0.000133 0.000790
2640 0.008 0.001119 0.000073

This shows how state size and time it takes to delete it is only loosely correlated - instead, one can see how sometimes it just spikes, presumably doing btree rebalancing or whatever.

Notably though, the very small blobs don't suffer from this, in the bottom left there's a whole bunch of small deletes that are all fast.

tersec commented 3 years ago

That and my experiments with using transactions to group the entire state pruning process usefully exclude a couple of potential methods.

One question is whether (via some delete flag column, or just DELETE FROM foo WHERE key = foo OR key = bar OR key = baz or some other method) doing all the deletions at once helps. It'd presumably with near-certainty hit the rebalancing spikes, but also only do so, well, once, in a bounded way, keeping the whole state pruning operation time bounded.

mratsim commented 3 years ago

SQLite has a detailed write up on their rebalancing algorithm: https://sqlite.org/btreemodule.html#btree_balancing_algorithm

It uses to have an async mode but note that it's now deprecated due to the WAL mode https://sqlite.org/asyncvfs.html.

Regarding deleting in a single transaction I guess we need to benchmark. And such a benchmark would be useful as well to evaluate potential new DB options in the future.

arnetheduck commented 3 years ago

delState will not delete anything most of the time btw, since we only write doubled states every.. 64 epochs - most of the time one of the deletes is a noop so transactions won't help too much

tersec commented 3 years ago

I used a patch something like ( test_transaction_patch.txt.

Really basic, just enough to pass BEGIN and COMMIT to SQLite (ROLLBACK was beyond the scope of this experiment), and too sloppy to actually merge into nbc as-is, but it could be cleaned up. But first I'd want evidence it helped, and in my measurements it just didn't.

The transactions were across the entire state delete operation though, not just a workaround for the double-state-delete due to two kinds of states being sometimes stored.

tersec commented 3 years ago

I timed aggregate deletions of 48,000 k/v blobs from mainnet in 1 through 10-sized chunks, except 7 and 9 (not divisible). This is with small blobs in there too, so until I filter out those, this isn't going to be representative of improvement potential for state pruning, but it's indicative of at least some advantage:

test_script_delete_01.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    2m19.255s
user    0m11.103s
sys 0m30.389s

test_script_delete_02.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    2m8.748s
user    0m9.185s
sys     0m28.654s

test_script_delete_03.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    1m57.218s
user    0m7.926s
sys     0m25.678s

test_script_delete_04.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    1m52.099s
user    0m7.519s
sys     0m24.540s

test_script_delete_05.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    1m48.823s                                                               
user    0m7.250s
sys     0m24.519s

test_script_delete_06.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    1m44.653s
user    0m6.988s
sys     0m23.244s

test_script_delete_08.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    1m43.271s
user    0m6.427s
sys     0m23.562s

test_script_delete_10.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real    1m41.181s
user    0m6.687s
sys     0m24.037s

test_script_delete_n.txt benchmarks deleting the n blobs at a time, enough to delete exactly 48,000 blobs.

I don't know what a typical number of deleted states per pruning is, but it might also be the case that the large latency spikes are only when there are quite a few states, so even advantages which require 10+ states deleted at once to attain might be worthwhile.

These were all auto-generated:

#!/usr/bin/env python3
from sys import argv

total_deletes = 48000 # highly factorizable
kvs_per_delete = int(argv[1])
num_delete_statements = total_deletes // kvs_per_delete

# Using non-factors creates slightly different total deletes
assert total_deletes % kvs_per_delete == 0
assert kvs_per_delete * num_delete_statements == total_deletes

for _ in range(num_delete_statements):
    print('DELETE FROM kvstore WHERE key = (SELECT key FROM kvstore LIMIT %d);'%kvs_per_delete)

This also avoids focusing on the per-delete performance initially. If the version of this with only states and/or other large state-like blobs shows potential, the more detailed performance characteristics of that can be investigated.

https://sqlite.org/np1queryprob.html claims that it's not important to avoid lots of small queries, but all its examples are SELECTs, so I don't know how relevant that is here.

tersec commented 3 years ago

Repeating this with only blobs >= 2,000,000 bytes, using the approximate threshold from @arnetheduck's chart, I removed all the sub-2MB blobs from kvstore, leaving 397 total. 397 is, unfortunately, prime, so using 396 of those large blobs:

1 per DELETE: 101.2s 2 per DELETE: 48.4s 3 per DELETE: 30.9s 4 per DELETE: 23.4s 6 per DELETE: 15.8s 11 per DELETE: 10.7s

The total numbers of DELETE statements required for each is [396, 198, 132, 99, 66, 36] respectively.

Thus, it's averaging, respectively, 3.9s/DELETE (delete 1) run, 4.1s/DELETE run (delete 2), 4.3s (delete 3), 4.2s (delete 4), 4.2s (delete 6), or 3.4s (delete 11). It's almost solely scaling by how many DELETE FROM statements it's running, not how many objects get deleted.

This points to the best way to improving this being to collect the keys and batching them in state pruning, then calling exactly one DELETE FROM statement on the underlying database. If one wants to avoid memory risks and/or supporting a quasi-streaming mode, batching them at least helps.

That was a nearly ideal batch-btree-rebalancing order, so to test a more realistic scenario, I added ORDER BY RANDOM() to the queries:

#!/usr/bin/env python3
from sys import argv

total_deletes = 396 
kvs_per_delete = int(argv[1])
num_delete_statements = total_deletes // kvs_per_delete

# Using non-factors creates slightly different total deletes
assert total_deletes % kvs_per_delete == 0
assert kvs_per_delete * num_delete_statements == total_deletes

for _ in range(num_delete_statements):
    print('DELETE FROM kvstore WHERE key = (SELECT key FROM kvstore ORDER BY RANDOM() LIMIT %d);'%kvs_per_delete)

It still helps, though not quite as dramatically, as the 1-per, 2-per, 3-per, 4-per, 6-per, and 11-per total second times for all 396 items are:

[108.7, 82.1, 65.0, 55.6, 40.2, 25.7]

Which translates into (same [1-per, 2-per, 3-per, 4-per, 6-per, 11-per vector):

[0.274s, 0.415s, 0.492s, 0.562s, 0.609s, 0.714s]

per DELETE FROM, on average, which means some were slower. It still scales helpfully, so that cost per deletion drops significantly even with random key orders. This still seems like the most useful immediate amelioration, and worthwhile.

arnetheduck commented 3 years ago

A key thing to keep in mind though is that we don't really care that much about absolute time to delete all states - rather we want to avoid spikes where one delete takes a long time - so for the multi-deletes, it would be important to see what the max time of the query is.

tersec commented 3 years ago

https://github.com/status-im/nimbus-eth2/blob/a727fc8e52da6ea007187bd235bebdcbb528789f/beacon_chain/consensus_object_pools/blockchain_dag.nim#L936-L952

operates as a single, uninterrupted loop. The timing that matters here is the max of the sum of the times, not the max individual time, for at least the single-delete case, of all the delState()s in the loop. If it deletes two states, it's better that one is instant and the other takes 0.1s rather than both taking 0.075s.

New methodology, to more precisely characterize this:

#!/usr/bin/env python3
from sys import argv

total_deletes = 240
kvs_per_delete = int(argv[1])
num_delete_statements = total_deletes // kvs_per_delete

# Using non-factors creates slightly different total deletes
assert total_deletes % kvs_per_delete == 0
assert kvs_per_delete * num_delete_statements == total_deletes

print('.timer on')
for _ in range(num_delete_statements):
    print('DELETE FROM kvstore WHERE key = (SELECT key FROM kvstore ORDER BY RANDOM() LIMIT %d);'%kvs_per_delete)

Which adds .timer on and only deletes 240 blobs. This is both more divisible than 396 and avoids odd timing quirks which affect the mean when deleting the last few blobs, which is unrealistic/unrepresentative of actual state pruning.

Then run:

rm test_scriptdelete*.txt -v && for i in 01 02 03 04 05 06 08 10 12; do ./gen_test_script.py $i > test_scriptdelete$i.txt; done

to generate the test scripts.

Run

time for s in test_scriptdelete*.txt; do echo $s && cp -v full_nbc_largeblobs.sqlite3 nbc_largeblobs.sqlite3 && time sqlite3 nbclargeblobs.sqlite3 < $s | cut -d" " -f4 > timings$s; done

To benchmark each multi-delete amount.

And finally, to get summary results of real time (seconds) spent per DELETE FROM query:

$ # columns are mean, median, p90, p95, max
$ for t in timings_test_script_delete_*.txt; do echo $t && datamash -R3 mean 1 perc:50 1 perc:90 1 perc:95 1 max 1 < $t && echo; done
timings_test_script_delete_01.txt
0.430   0.368   1.031   1.264   2.325

timings_test_script_delete_02.txt
0.576   0.498   1.133   1.370   1.785

timings_test_script_delete_03.txt
0.604   0.501   1.137   1.277   1.848

timings_test_script_delete_04.txt
0.730   0.610   1.420   1.460   2.061

timings_test_script_delete_05.txt
0.805   0.699   1.610   1.740   1.922

timings_test_script_delete_06.txt
0.735   0.614   1.415   1.509   1.815

timings_test_script_delete_08.txt
0.795   0.699   1.478   1.514   2.141

timings_test_script_delete_10.txt
0.883   0.887   1.404   1.457   2.101

timings_test_script_delete_12.txt
0.888   0.868   1.546   1.599   1.612

The median and mean do increase, because more of the cases hit whatever the threshold is for expensive deletes. However, the maximum latency spikes and p90 and p95 tail latencies doesn't change in any obviously systematically way, but fluctuate due to the randomness of the key deletion order between runs.

In this set of results, the corresponding overall times to delete all 240 keys chosen are:

test_script_delete_01.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    1m43.185s
user    0m4.671s
sys 0m21.050s
test_script_delete_02.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    1m9.165s
user    0m2.482s
sys 0m12.448s
test_script_delete_03.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m48.419s
user    0m1.775s
sys 0m8.711s
test_script_delete_04.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m43.863s
user    0m1.490s
sys 0m7.732s
test_script_delete_05.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m38.665s
user    0m1.007s
sys 0m7.495s
test_script_delete_06.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m29.427s
user    0m0.990s
sys 0m4.991s
test_script_delete_08.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m23.860s
user    0m0.700s
sys 0m4.114s
test_script_delete_10.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m21.222s
user    0m0.563s
sys 0m3.752s
test_script_delete_12.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'

real    0m17.800s
user    0m0.523s
sys 0m3.183s

arnetheduck commented 3 years ago

State pruning timings on raspi4, courtesy of @jclapis

jcrtp-pruning-sqlite.txt

arnetheduck commented 3 years ago

Following up on @zah's lead on sqlite keeping separate btrees per table, here's what happens with large objects when they're put in a table of their own:

sqlite> create table large as select key, value from kvstore where length(value) > 65536;
sqlite> select count(*) from large;
1314

It actually looks really good - the times are much tighter for a pyrmont test - in the graph, there's the big old full states (11:ish mb) and a bunch of new no-validators states (2:ish mb) and they both are drastically easier to get rid of - close to raw filesystem performance

Screenshot from 2021-04-06 12-56-18

Blue dots is removal of large blobs from separate table, red dots is removing the same blobs from shared table

zah commented 3 years ago

With these results, I think the UX simplicity of managing a single database file, the ability to use transactions and the lack of reliance on file system characteristics weights in favor of using Sqlite.

arnetheduck commented 3 years ago

Pyrmont, unstable small - averages around 21s:

{"lvl":"DBG","ts":"2021-04-05 14:02:59.839+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"10s690ms836us718ns","epochRefPruningDur":"3us800ns"}
{"lvl":"DBG","ts":"2021-04-05 14:09:01.003+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"23s62ms612us704ns","epochRefPruningDur":"3us290ns"}
{"lvl":"DBG","ts":"2021-04-05 14:15:31.969+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"24s65ms401us189ns","epochRefPruningDur":"3us210ns"}
{"lvl":"DBG","ts":"2021-04-05 14:21:37.969+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"20s290ms722us828ns","epochRefPruningDur":"2us850ns"}
{"lvl":"DBG","ts":"2021-04-05 14:28:15.526+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"22s322ms591us959ns","epochRefPruningDur":"3us60ns"}
{"lvl":"DBG","ts":"2021-04-05 14:34:19.612+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"20s239ms460us572ns","epochRefPruningDur":"2us330ns"}
{"lvl":"DBG","ts":"2021-04-05 14:40:35.292+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"4us790ns","epochRefPruningDur":"1us190ns"}
{"lvl":"DBG","ts":"2021-04-05 14:48:10.974+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"26s754ms229us548ns","epochRefPruningDur":"2us550ns"}
{"lvl":"DBG","ts":"2021-04-05 14:54:42.249+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"19s466ms272us376ns","epochRefPruningDur":"3us20ns"}
{"lvl":"DBG","ts":"2021-04-05 15:00:06.560+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"21s691ms90us719ns","epochRefPruningDur":"2us470ns"}

Post https://github.com/status-im/nimbus-eth2/pull/2460, using separate table, avg around 16s which certainly is an improvement:

{"lvl":"DBG","ts":"2021-04-08 07:06:40.706+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"19s457ms836us357ns","epochRefPruningDur":"2us640ns"}
{"lvl":"DBG","ts":"2021-04-08 07:12:51.741+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"23s283ms204us190ns","epochRefPruningDur":"2us0ns"}
{"lvl":"DBG","ts":"2021-04-08 07:19:12.031+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"13s559ms195us501ns","epochRefPruningDur":"2us210ns"}
{"lvl":"DBG","ts":"2021-04-08 07:25:36.958+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"17s171ms297us438ns","epochRefPruningDur":"2us390ns"}
{"lvl":"DBG","ts":"2021-04-08 07:31:38.449+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"5us910ns","epochRefPruningDur":"1us60ns"}
{"lvl":"DBG","ts":"2021-04-08 07:38:20.214+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"17s729ms162us636ns","epochRefPruningDur":"2us110ns"}
{"lvl":"DBG","ts":"2021-04-08 07:45:04.142+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"16s29ms125us99ns","epochRefPruningDur":"2us110ns"}
{"lvl":"DBG","ts":"2021-04-08 07:52:30.557+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"21s426ms22us84ns","epochRefPruningDur":"2us890ns"}
{"lvl":"DBG","ts":"2021-04-08 07:57:28.643+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"7s989ms811us570ns","epochRefPruningDur":"2us540ns"}

zah commented 3 years ago

Still quite slow. Now that the files version is behind a flag, we can enable it on these servers for a while to gather statistics.

arnetheduck commented 3 years ago

Resolved with https://github.com/status-im/nimbus-eth2/pull/2560 - migrating existing data to the new layout will further improve performance