Closed tersec closed 1 month ago
9 files ±0 1 334 suites ±0 28m 53s :stopwatch: -29s 5 000 tests ±0 4 652 :heavy_check_mark: ±0 348 :zzz: ±0 0 :x: ±0 20 880 runs ±0 20 476 :heavy_check_mark: ±0 404 :zzz: ±0 0 :x: ±0
Results for commit 378d8b4d. ± Comparison against base commit 6a576978.
:recycle: This comment has been updated with latest results.
Around 25% faster on ordinary deposit processing, too. Comparing blocks on 2024-08-03 on linux-0x.ih-eu-mda1.nimbus.mainnet
hosts, for blocks without deposits:
$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.20240803_{13,20}00.log.gz | grep ,.deposits_len.:0, | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
3490 56564476.147278 54566626.5 55893542.5 57634373.75
for stable
and
$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.
20240803_{13,20}00.log.gz | grep ,.deposits_len.:0, | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
3488 57014482.569954 54880211.5 56296166.5 58097381.5
for unstable
, where it's quicker to scan just those two hours because that's also when the blocks with deposits are, and the sample size suffices regardless.
These are effectively the same, no real change. By contrast, with deposits, stable
(without these optimizations) has
$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.20240803_{13,20}00.log.gz | grep -v ,.deposits_len.:0, | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
96 381539194.44792 342287028.25 371257674.5 415338661.75
whereas unstable
brings that down to
$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.20240803_{13,20}00.log.gz | grep -v ,.deposits_len.:0, | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
96 272465554.0625 268435887.5 277284808 289672389
for this 25+% improvement.
Prior to Electra, the Bloom filter along with the reversed search order sufficed to improve deposit processing speeds, and nowhere else in the beacon chain spec was it required to look up a validator index from a pubkey. The Bloom filter exploited that finding the actual validator index wasn't even necessary, only whether it existed:
However, in Electra, execution layer withdrawal requests and consolidation requests both require ascertaining the specific validator index, not merely its presence or absence, and the latter does so twice per request.
Because there can be up to 16 withdrawal and 1 consolidation request(s) per block, this might require up to 18 full validator set scans per block for those functions alone.
One obvious approach is to use a
Table
or similar construction, but even with some clever encoding, that typically requires hundreds (plural) of additional megabytes for this functionality. It's not required to be O(1), just fast enough to run a few dozen times per block (because as of this PR it's now responsible for deposit processing too, which can occur 16 times per block, in addition to the deposit requests).The Bloom filter had been configured to use a single 8MB array, and since Nimbus represents
ValidatorIndex
asdistinct int32
, for 2 million validators total (already at ~1.6M, not just active), the bucket sort is equivalent in memory usage, while taking the same approximately 12ms to initialize and still doing a full search in about 0.1ms, which is fast enough to construct it per block and avoid long-lived cache weirdness and allow a couple of hundred searches without serious impact on block processing time.The data structure gets a little convoluted with a separate one-time-constructed array and a separate free-form
seq
, which was arrived at after benchmarking: a version which constructed (initially also perfectly allocated and uninitialized)seq
s lost in initialization time by 3-4ms, or about 33%, because the nestedarray
/seq
indexing combined withboundChecks:on
made for a rather hot inner loop. Because it's a short-lived acceleration structure, it can safely assume that theseextraItems
won't be many, only as many validators as a single block can introduce, and it never appears in a tight loop, so simple is fine.