Closed mratsim closed 3 years ago
delState
now does two SQL deletes, to handle both full states and states without immutable validator information:
https://github.com/status-im/nimbus-eth2/blob/baa43ae7f7f58807f2bf5308f8ceb420aae9d3fe/beacon_chain/beacon_chain_db.nim#L400-L403
It'd require refactoring abstractions, but that could be reduced to one conceptually readily.
A better approach would involve explicitly handling transactions, rather than the implicit, per-SQL-statement (per-state-delete x2) transactions currently used.
It'd be interesting to see sort of a null-transaction-benchmark. Small k/v store objects being just deleted en masse, with/without separate per-SQL-statement (implicit) transactions. If that's not the problem, then neither of the above issues is worth pursuing for this issue's sake.
played with this script on a mainnet database copy:
echo -e 'select length(value) from kvstore limit 1;\n.timer on\nDELETE FROM kvstore WHERE key = (SELECT key FROM kvstore LIMIT 1 );' | sqlite3 nbc.sqlite3 >> timings.txt
it deletes rows one at a time and prints how long it took - in a for loop it looks like for i in {0..100000}; do sh test.sh ; done
then massage it into single rows:
sed 'N;s/\n/ /' timings.txt | sed "s/Run Time: real //" | sed "s/user //" | sed "s/sys //" > timings2.csv
then it can be opened in openoffice calc (I wanted to excel here!):
random big blobs:
9631747 0.328 0.042149 0.076036
8613696 0.074 0.022290 0.033821
10047540 0.084 0.019399 0.050170
3088841 0.222 0.027514 0.062269
6019100 0.323 0.027789 0.083814
8741445 0.078 0.018163 0.043314
3471708 0.192 0.020461 0.059802
8889780 0.512 0.041763 0.118680
6196252 0.272 0.031240 0.086068
8074031 0.067 0.016016 0.038331
10011870 0.291 0.029829 0.084682
random small blobs:
17663 0.012 0.000747 0.000467
3484 0.007 0.000948 0.000000
1912 0.008 0.001126 0.000021
15669 0.008 0.000007 0.001080
4663 0.008 0.000993 0.000000
2410 0.008 0.000000 0.000917
2104 0.008 0.000519 0.000534
5552 0.008 0.000436 0.000579
4789 0.008 0.000078 0.000964
4139 0.008 0.000564 0.000423
4180 0.009 0.000658 0.000494
6777 0.010 0.000825 0.000000
3977 0.008 0.000498 0.000000
1864 0.007 0.000231 0.000116
14452 0.007 0.000000 0.000438
4836 0.007 0.000000 0.000558
1638 0.008 0.000000 0.000954
4389 0.008 0.000670 0.000269
2727 0.008 0.000038 0.001026
4923 0.008 0.000731 0.000367
3920 0.009 0.000052 0.000946
2515 0.008 0.000000 0.000981
8640 0.008 0.001005 0.000167
5260 0.008 0.000485 0.000633
3982 0.007 0.000930 0.000000
16369 0.008 0.001102 0.000025
1969 0.007 0.000000 0.000918
2943 0.009 0.000000 0.001265
2271 0.009 0.000098 0.001015
3669 0.007 0.000000 0.000882
6915 0.007 0.000000 0.000967
14921 0.008 0.001046 0.000000
2246 0.007 0.000133 0.000790
2640 0.008 0.001119 0.000073
This shows how state size and time it takes to delete it is only loosely correlated - instead, one can see how sometimes it just spikes, presumably doing btree rebalancing or whatever.
Notably though, the very small blobs don't suffer from this, in the bottom left there's a whole bunch of small deletes that are all fast.
That and my experiments with using transactions to group the entire state pruning process usefully exclude a couple of potential methods.
One question is whether (via some delete
flag column, or just DELETE FROM foo WHERE key = foo OR key = bar OR key = baz
or some other method) doing all the deletions at once helps. It'd presumably with near-certainty hit the rebalancing spikes, but also only do so, well, once, in a bounded way, keeping the whole state pruning operation time bounded.
SQLite has a detailed write up on their rebalancing algorithm: https://sqlite.org/btreemodule.html#btree_balancing_algorithm
It uses to have an async mode but note that it's now deprecated due to the WAL mode https://sqlite.org/asyncvfs.html.
Regarding deleting in a single transaction I guess we need to benchmark. And such a benchmark would be useful as well to evaluate potential new DB options in the future.
delState
will not delete anything most of the time btw, since we only write doubled states every.. 64 epochs - most of the time one of the deletes is a noop so transactions won't help too much
I used a patch something like ( test_transaction_patch.txt.
Really basic, just enough to pass BEGIN
and COMMIT
to SQLite (ROLLBACK
was beyond the scope of this experiment), and too sloppy to actually merge into nbc as-is, but it could be cleaned up. But first I'd want evidence it helped, and in my measurements it just didn't.
The transactions were across the entire state delete operation though, not just a workaround for the double-state-delete due to two kinds of states being sometimes stored.
I timed aggregate deletions of 48,000 k/v blobs from mainnet in 1 through 10-sized chunks, except 7 and 9 (not divisible). This is with small blobs in there too, so until I filter out those, this isn't going to be representative of improvement potential for state pruning, but it's indicative of at least some advantage:
test_script_delete_01.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 2m19.255s
user 0m11.103s
sys 0m30.389s
test_script_delete_02.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 2m8.748s
user 0m9.185s
sys 0m28.654s
test_script_delete_03.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 1m57.218s
user 0m7.926s
sys 0m25.678s
test_script_delete_04.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 1m52.099s
user 0m7.519s
sys 0m24.540s
test_script_delete_05.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 1m48.823s
user 0m7.250s
sys 0m24.519s
test_script_delete_06.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 1m44.653s
user 0m6.988s
sys 0m23.244s
test_script_delete_08.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 1m43.271s
user 0m6.427s
sys 0m23.562s
test_script_delete_10.txt
'full_nbc.sqlite3' -> 'nbc.sqlite3'
real 1m41.181s
user 0m6.687s
sys 0m24.037s
test_script_delete_n.txt
benchmarks deleting the n
blobs at a time, enough to delete exactly 48,000 blobs.
I don't know what a typical number of deleted states per pruning is, but it might also be the case that the large latency spikes are only when there are quite a few states, so even advantages which require 10+ states deleted at once to attain might be worthwhile.
These were all auto-generated:
#!/usr/bin/env python3
from sys import argv
total_deletes = 48000 # highly factorizable
kvs_per_delete = int(argv[1])
num_delete_statements = total_deletes // kvs_per_delete
# Using non-factors creates slightly different total deletes
assert total_deletes % kvs_per_delete == 0
assert kvs_per_delete * num_delete_statements == total_deletes
for _ in range(num_delete_statements):
print('DELETE FROM kvstore WHERE key = (SELECT key FROM kvstore LIMIT %d);'%kvs_per_delete)
This also avoids focusing on the per-delete performance initially. If the version of this with only states and/or other large state-like blobs shows potential, the more detailed performance characteristics of that can be investigated.
https://sqlite.org/np1queryprob.html claims that it's not important to avoid lots of small queries, but all its examples are SELECT
s, so I don't know how relevant that is here.
Repeating this with only blobs >= 2,000,000 bytes, using the approximate threshold from @arnetheduck's chart, I removed all the sub-2MB blobs from kvstore
, leaving 397 total. 397 is, unfortunately, prime, so using 396 of those large blobs:
1 per DELETE
: 101.2s
2 per DELETE
: 48.4s
3 per DELETE
: 30.9s
4 per DELETE
: 23.4s
6 per DELETE
: 15.8s
11 per DELETE
: 10.7s
The total numbers of DELETE
statements required for each is [396, 198, 132, 99, 66, 36]
respectively.
Thus, it's averaging, respectively, 3.9s/DELETE
(delete 1) run, 4.1s/DELETE
run (delete 2), 4.3s (delete 3), 4.2s (delete 4), 4.2s (delete 6), or 3.4s (delete 11). It's almost solely scaling by how many DELETE FROM
statements it's running, not how many objects get deleted.
This points to the best way to improving this being to collect the keys and batching them in state pruning, then calling exactly one DELETE FROM
statement on the underlying database. If one wants to avoid memory risks and/or supporting a quasi-streaming mode, batching them at least helps.
That was a nearly ideal batch-btree-rebalancing order, so to test a more realistic scenario, I added ORDER BY RANDOM()
to the queries:
#!/usr/bin/env python3
from sys import argv
total_deletes = 396
kvs_per_delete = int(argv[1])
num_delete_statements = total_deletes // kvs_per_delete
# Using non-factors creates slightly different total deletes
assert total_deletes % kvs_per_delete == 0
assert kvs_per_delete * num_delete_statements == total_deletes
for _ in range(num_delete_statements):
print('DELETE FROM kvstore WHERE key = (SELECT key FROM kvstore ORDER BY RANDOM() LIMIT %d);'%kvs_per_delete)
It still helps, though not quite as dramatically, as the 1-per, 2-per, 3-per, 4-per, 6-per, and 11-per total second times for all 396 items are:
[108.7, 82.1, 65.0, 55.6, 40.2, 25.7]
Which translates into (same [1-per, 2-per, 3-per, 4-per, 6-per, 11-per vector):
[0.274s, 0.415s, 0.492s, 0.562s, 0.609s, 0.714s]
per DELETE FROM
, on average, which means some were slower. It still scales helpfully, so that cost per deletion drops significantly even with random key orders. This still seems like the most useful immediate amelioration, and worthwhile.
A key thing to keep in mind though is that we don't really care that much about absolute time to delete all states - rather we want to avoid spikes where one delete takes a long time - so for the multi-deletes, it would be important to see what the max time of the query is.
operates as a single, uninterrupted loop. The timing that matters here is the max of the sum of the times, not the max individual time, for at least the single-delete case, of all the delState()
s in the loop. If it deletes two states, it's better that one is instant and the other takes 0.1s rather than both taking 0.075s.
New methodology, to more precisely characterize this:
#!/usr/bin/env python3
from sys import argv
total_deletes = 240
kvs_per_delete = int(argv[1])
num_delete_statements = total_deletes // kvs_per_delete
# Using non-factors creates slightly different total deletes
assert total_deletes % kvs_per_delete == 0
assert kvs_per_delete * num_delete_statements == total_deletes
print('.timer on')
for _ in range(num_delete_statements):
print('DELETE FROM kvstore WHERE key = (SELECT key FROM kvstore ORDER BY RANDOM() LIMIT %d);'%kvs_per_delete)
Which adds .timer on
and only deletes 240 blobs. This is both more divisible than 396 and avoids odd timing quirks which affect the mean when deleting the last few blobs, which is unrealistic/unrepresentative of actual state pruning.
Then run:
rm test_scriptdelete*.txt -v && for i in 01 02 03 04 05 06 08 10 12; do ./gen_test_script.py $i > test_scriptdelete$i.txt; done
to generate the test scripts.
Run
time for s in test_scriptdelete*.txt; do echo $s && cp -v full_nbc_largeblobs.sqlite3 nbc_largeblobs.sqlite3 && time sqlite3 nbclargeblobs.sqlite3 < $s | cut -d" " -f4 > timings$s; done
To benchmark each multi-delete amount.
And finally, to get summary results of real time (seconds) spent per DELETE FROM
query:
$ # columns are mean, median, p90, p95, max
$ for t in timings_test_script_delete_*.txt; do echo $t && datamash -R3 mean 1 perc:50 1 perc:90 1 perc:95 1 max 1 < $t && echo; done
timings_test_script_delete_01.txt
0.430 0.368 1.031 1.264 2.325
timings_test_script_delete_02.txt
0.576 0.498 1.133 1.370 1.785
timings_test_script_delete_03.txt
0.604 0.501 1.137 1.277 1.848
timings_test_script_delete_04.txt
0.730 0.610 1.420 1.460 2.061
timings_test_script_delete_05.txt
0.805 0.699 1.610 1.740 1.922
timings_test_script_delete_06.txt
0.735 0.614 1.415 1.509 1.815
timings_test_script_delete_08.txt
0.795 0.699 1.478 1.514 2.141
timings_test_script_delete_10.txt
0.883 0.887 1.404 1.457 2.101
timings_test_script_delete_12.txt
0.888 0.868 1.546 1.599 1.612
The median and mean do increase, because more of the cases hit whatever the threshold is for expensive deletes. However, the maximum latency spikes and p90 and p95 tail latencies doesn't change in any obviously systematically way, but fluctuate due to the randomness of the key deletion order between runs.
In this set of results, the corresponding overall times to delete all 240 keys chosen are:
test_script_delete_01.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 1m43.185s
user 0m4.671s
sys 0m21.050s
test_script_delete_02.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 1m9.165s
user 0m2.482s
sys 0m12.448s
test_script_delete_03.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m48.419s
user 0m1.775s
sys 0m8.711s
test_script_delete_04.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m43.863s
user 0m1.490s
sys 0m7.732s
test_script_delete_05.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m38.665s
user 0m1.007s
sys 0m7.495s
test_script_delete_06.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m29.427s
user 0m0.990s
sys 0m4.991s
test_script_delete_08.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m23.860s
user 0m0.700s
sys 0m4.114s
test_script_delete_10.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m21.222s
user 0m0.563s
sys 0m3.752s
test_script_delete_12.txt
'full_nbc_largeblobs.sqlite3' -> 'nbc_largeblobs.sqlite3'
real 0m17.800s
user 0m0.523s
sys 0m3.183s
State pruning timings on raspi4, courtesy of @jclapis
Following up on @zah's lead on sqlite keeping separate btrees per table, here's what happens with large objects when they're put in a table of their own:
sqlite> create table large as select key, value from kvstore where length(value) > 65536;
sqlite> select count(*) from large;
1314
It actually looks really good - the times are much tighter for a pyrmont test - in the graph, there's the big old full states (11:ish mb) and a bunch of new no-validators states (2:ish mb) and they both are drastically easier to get rid of - close to raw filesystem performance
Blue dots is removal of large blobs from separate table, red dots is removing the same blobs from shared table
With these results, I think the UX simplicity of managing a single database file, the ability to use transactions and the lack of reliance on file system characteristics weights in favor of using Sqlite.
Pyrmont, unstable small - averages around 21s:
{"lvl":"DBG","ts":"2021-04-05 14:02:59.839+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"10s690ms836us718ns","epochRefPruningDur":"3us800ns"}
{"lvl":"DBG","ts":"2021-04-05 14:09:01.003+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"23s62ms612us704ns","epochRefPruningDur":"3us290ns"}
{"lvl":"DBG","ts":"2021-04-05 14:15:31.969+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"24s65ms401us189ns","epochRefPruningDur":"3us210ns"}
{"lvl":"DBG","ts":"2021-04-05 14:21:37.969+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"20s290ms722us828ns","epochRefPruningDur":"2us850ns"}
{"lvl":"DBG","ts":"2021-04-05 14:28:15.526+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"22s322ms591us959ns","epochRefPruningDur":"3us60ns"}
{"lvl":"DBG","ts":"2021-04-05 14:34:19.612+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"20s239ms460us572ns","epochRefPruningDur":"2us330ns"}
{"lvl":"DBG","ts":"2021-04-05 14:40:35.292+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"4us790ns","epochRefPruningDur":"1us190ns"}
{"lvl":"DBG","ts":"2021-04-05 14:48:10.974+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"26s754ms229us548ns","epochRefPruningDur":"2us550ns"}
{"lvl":"DBG","ts":"2021-04-05 14:54:42.249+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"19s466ms272us376ns","epochRefPruningDur":"3us20ns"}
{"lvl":"DBG","ts":"2021-04-05 15:00:06.560+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"21s691ms90us719ns","epochRefPruningDur":"2us470ns"}
Post https://github.com/status-im/nimbus-eth2/pull/2460, using separate table, avg around 16s which certainly is an improvement:
{"lvl":"DBG","ts":"2021-04-08 07:06:40.706+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"19s457ms836us357ns","epochRefPruningDur":"2us640ns"}
{"lvl":"DBG","ts":"2021-04-08 07:12:51.741+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"23s283ms204us190ns","epochRefPruningDur":"2us0ns"}
{"lvl":"DBG","ts":"2021-04-08 07:19:12.031+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"13s559ms195us501ns","epochRefPruningDur":"2us210ns"}
{"lvl":"DBG","ts":"2021-04-08 07:25:36.958+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"17s171ms297us438ns","epochRefPruningDur":"2us390ns"}
{"lvl":"DBG","ts":"2021-04-08 07:31:38.449+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"5us910ns","epochRefPruningDur":"1us60ns"}
{"lvl":"DBG","ts":"2021-04-08 07:38:20.214+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"17s729ms162us636ns","epochRefPruningDur":"2us110ns"}
{"lvl":"DBG","ts":"2021-04-08 07:45:04.142+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"16s29ms125us99ns","epochRefPruningDur":"2us110ns"}
{"lvl":"DBG","ts":"2021-04-08 07:52:30.557+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"21s426ms22us84ns","epochRefPruningDur":"2us890ns"}
{"lvl":"DBG","ts":"2021-04-08 07:57:28.643+00:00","msg":"Pruned the state checkpoints and DAG caches.","topics":"chaindag","tid":1,"file":"blockchain_dag.nim:964","statePruningDur":"7s989ms811us570ns","epochRefPruningDur":"2us540ns"}
Still quite slow. Now that the files version is behind a flag, we can enable it on these servers for a while to gather statistics.
Resolved with https://github.com/status-im/nimbus-eth2/pull/2560 - migrating existing data to the new layout will further improve performance
Moving pruning to slot end was implemented in #2384 and #2422 closing #2381
On a high-end NVMe SSD
It takes 220 milliseconds to prune after finalization even though only very few states should have been accumulated.