hashgraph / hedera-services

Crypto, token, consensus, file, and smart contract services for the Hedera public ledger
Apache License 2.0
266 stars 119 forks source link

Investigate `ValidateLeafIndexHalfDiskHashMap` failure #13929

Closed imalygin closed 3 weeks ago

imalygin commented 3 weeks ago

Description

Node08 consistently fails ValidateLeafIndexHalfDiskHashMap.validateInde validation. It occurs every round. Here is an example https://github.com/hashgraph/hedera-state-validator/actions/runs/9603442996/job/26486697813

With the following error message:

Unexpected key info:  
[UnexpectedKeyInfo{path=1, expectedKey=OnDiskKey{key=ScheduleID[shardNum=0, realmNum=0, scheduleNum=6102154]}, actualKey=OnDiskKey{key=ScheduleID[shardNum=0, realmNum=0, scheduleNum=6195571]}}
]
There are 1 records with unexpected keys, please check the logs for more info
ValidateLeafIndexHalfDiskHashMap.validateIndex,[5] ScheduleService.SCHEDULES_BY_ID - FAILED, time taken - 0 sec 

This validation fails for all SCHEDULES_BY_* maps.

This failure means that in one of the buckets in the HalfDiskHashMap contains an entry such that if you take a path from this entry and use this path to lookup a key using pathToDiskLocationLeafNodes, you'll get a record with a different key.

Steps to reproduce

  1. Download the validator:

    gsutil cp gs://hedera-ci-ephemeral-artifacts/hedera/hedera-state-validator/validator-13929-0-50.jar .

    This validator is built from 13929-0-50 branch in hedera-state-validator repo and hedera-services repo.

  2. Download the state for Node08:

    gsutil -m cp -r gs://mainnet-1m/hedboblck02.cs.boeing.com/174611706/ .
  3. Run the following command:

    java -jar validator-13929-0-50.jar 174611706 hdhm

Additional context

Interestingly enough, it happens only on a single node.

Hedera network

mainnet

Version

v0.50

Operating system

None

artemananiev commented 3 weeks ago

I cannot reproduce it locally, tried both 0.50 and 0.51 versions, and a couple different snapshots that previously failed validation.

UPD: I found a way to reproduce it now.

artemananiev commented 3 weeks ago

This appeared to be a bug in MerkleDbDataSource. When all elements are removed from a virtual map, the final flush to disk looks like this:

The assumption is all these leaf records will be deleted from the data source. However, MerkleDb just checks if first/last leaf path is -1, and does nothing in this case. It results in the leaves to be preserved in data files, and this is exactly what the validation tool detects.

The fix is pretty straightforward: drop the check for first/last leaf paths and delete the leaf records regardless. With this check removed, MerkleDb needs to make sure no new data file is created, if both deleted and dirty leaf streams are empty.

artemananiev commented 3 weeks ago

Such a change will not make the validation tool pass on the old state snapshots, but it will prevent similar failures in the future.