replace validator Bloom filter with partial bucket sort

tersec commented 1 month ago

Prior to Electra, the Bloom filter along with the reversed search order sufficed to improve deposit processing speeds, and nowhere else in the beacon chain spec was it required to look up a validator index from a pubkey. The Bloom filter exploited that finding the actual validator index wasn't even necessary, only whether it existed:

def apply_deposit(state: BeaconState,
    ...
    validator_pubkeys = [v.pubkey for v in state.validators]
    if pubkey not in validator_pubkeys:
        # Verify the deposit signature (proof of possession) which is not checked by the deposit contract
        deposit_message = DepositMessage(
            pubkey=pubkey,
            withdrawal_credentials=withdrawal_credentials,
            amount=amount,
        )
   ...

However, in Electra, execution layer withdrawal requests and consolidation requests both require ascertaining the specific validator index, not merely its presence or absence, and the latter does so twice per request.

Because there can be up to 16 withdrawal and 1 consolidation request(s) per block, this might require up to 18 full validator set scans per block for those functions alone.

One obvious approach is to use a Table or similar construction, but even with some clever encoding, that typically requires hundreds (plural) of additional megabytes for this functionality. It's not required to be O(1), just fast enough to run a few dozen times per block (because as of this PR it's now responsible for deposit processing too, which can occur 16 times per block, in addition to the deposit requests).

The Bloom filter had been configured to use a single 8MB array, and since Nimbus represents ValidatorIndex as distinct int32, for 2 million validators total (already at ~1.6M, not just active), the bucket sort is equivalent in memory usage, while taking the same approximately 12ms to initialize and still doing a full search in about 0.1ms, which is fast enough to construct it per block and avoid long-lived cache weirdness and allow a couple of hundred searches without serious impact on block processing time.

The data structure gets a little convoluted with a separate one-time-constructed array and a separate free-form seq, which was arrived at after benchmarking: a version which constructed (initially also perfectly allocated and uninitialized) seqs lost in initialization time by 3-4ms, or about 33%, because the nested array/seq indexing combined with boundChecks:on made for a rather hot inner loop. Because it's a short-lived acceleration structure, it can safely assume that these extraItems won't be many, only as many validators as a single block can introduce, and it never appears in a tight loop, so simple is fine.

github-actions[bot] commented 1 month ago

Unit Test Results

        9 files ±0   1 334 suites ±0 28m 53s :stopwatch: -29s   5 000 tests ±0   4 652 :heavy_check_mark: ±0 348 :zzz: ±0 0 :x: ±0 20 880 runs ±0 20 476 :heavy_check_mark: ±0 404 :zzz: ±0 0 :x: ±0

Results for commit 378d8b4d. ± Comparison against base commit 6a576978.

:recycle: This comment has been updated with latest results.

tersec commented 1 month ago

Around 25% faster on ordinary deposit processing, too. Comparing blocks on 2024-08-03 on linux-0x.ih-eu-mda1.nimbus.mainnet hosts, for blocks without deposits:

$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.20240803_{13,20}00.log.gz | grep ,.deposits_len.:0,  | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
3490    56564476.147278 54566626.5  55893542.5  57634373.75

for stable and

$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.
20240803_{13,20}00.log.gz | grep ,.deposits_len.:0,  | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
3488    57014482.569954 54880211.5  56296166.5  58097381.5

for unstable, where it's quicker to scan just those two hours because that's also when the blocks with deposits are, and the sample size suffices regardless.

These are effectively the same, no real change. By contrast, with deposits, stable (without these optimizations) has

$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.20240803_{13,20}00.log.gz | grep -v ,.deposits_len.:0,  | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
96  381539194.44792 342287028.25    371257674.5 415338661.75

whereas unstable brings that down to

$ for i in $(seq 1 6); do ssh linux-0$i.ih-eu-mda1.nimbus.mainnet "zgrep -h Block.resolved ${NODE_LOGS}/service.20240803_{13,20}00.log.gz | grep -v ,.deposits_len.:0,  | jq .stateVerifyDur.value"; done | datamash count 1 mean 1 q1 1 median 1 q3 1
96  272465554.0625  268435887.5 277284808   289672389

for this 25+% improvement.

status-im / nimbus-eth2

replace validator Bloom filter with partial bucket sort #6469

Unit Test Results