Closed walnut-the-cat closed 6 months ago
The backtrace from the linked zulip thread suggests that the panic happened inside <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
2024-04-16T20:21:23.545144Z DEBUG chunk_tracing{chunk_hash=HnFSQEoLMEnMXK2pxnnnbv7GkwFobanyrd7JJbNS2Rrj}:new_chunk{shard_id=3}:apply_chunk{shard_id=3}:process_state_update:apply{protocol_version=84 num_transactions=19}:process_receipt{receipt_id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP predecessor=operator.meta-pool.near receiver=lockup-meta-pool.near id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP}:run{code.hash=EXekfV3kpFHHsTi4JUDh2MVLCKS3hpKdPbXMuRirxrvY vm_kind=NearVm}: vm: close time.busy=49.3µs time.idle=3.42µs
thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, 5FWvfWAJxH1mbCHuzLGwBfL9EYjH8YWVin6Pmp3H8gdM)
stack backtrace:
0: rust_begin_unwind
1: core::panicking::panic_fmt
2: core::result::unwrap_failed
3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
4: near_store::trie::Trie::internal_retrieve_trie_node
5: near_store::trie::Trie::retrieve_raw_node
6: near_store::trie::Trie::lookup_from_state_column
7: near_store::trie::Trie::get_optimized_ref
8: near_store::trie::Trie::get
9: near_store::trie::update::TrieUpdate::get
10: near_store::get_code
11: node_runtime::actions::execute_function_call
12: node_runtime::Runtime::apply_action
13: node_runtime::Runtime::apply_action_receipt
14: node_runtime::Runtime::apply::{{closure}}
15: node_runtime::Runtime::apply
16: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
17: near_chain::update_shard::apply_new_chunk
18: core::ops::function::FnOnce::call_once{{vtable.shim}}
19: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
20: rayon_core::registry::WorkerThread::wait_until_cold
But looking at master
, there's no unwraps in this function 0_o:
https://github.com/near/nearcore/blob/0202ed61157de207ecb409fe287551518491f74d/core/store/src/trie/trie_storage.rs#L311-L322
That's pretty confusing 0_O.
@staffik you mentioned that the code was at commit b62b6a
, but I don't know where to find it, is it on some private branch?
This was commit from Alex PR: https://github.com/near/nearcore/pull/11071/commits/b62b6a3cf1e38c947ea523dc855d58444bf24109. Maybe this unwrap was removed recently in master.
This was commit from Alex PR: https://github.com/near/nearcore/commit/b62b6a3cf1e38c947ea523dc855d58444bf24109.
Ah thanks for the link, now I see it :+1:
Maybe this unwrap was removed recently in master.
I took a look at it's the same code as on master
.
Mysterious 0_O
I see this line (called through validate_chunk_state_witness --> apply_new_chunk --> apply_chunk may also get to MissingTrieValue: https://github.com/near/nearcore/blob/a1a01b45c53c1619c5c9ddfc0d5c9f1cf7a39d43/chain/chain/src/runtime/mod.rs#L917
Oh, we used tracing. It was: https://github.com/near/nearcore/pull/10843/files#diff-e073548a40d97af14f75cf143fab41a1cffe61d159e0b9a6297daeab0b2a5d45R317
Ahh ok, so the panic was caused by custom code that was added for debug purposes. The code on master
doesn't have expect("!!!CRASH!!!"));
, so there's nothing to fix there.
I see this line (called through validate_chunk_state_witness --> apply_new_chunk --> apply_chunk may also get to MissingTrieValue:
Err(e) => match e {
Error::StorageError(err) => match &err {
StorageError::FlatStorageBlockNotSupported(_)
| StorageError::MissingTrieValue(..) => Err(err.into()),
_ => panic!("{err}"),
},
_ => Err(e),
},
This won't panic on MissingTrieValue
, it'll return an error.
It could panic if it hits another StorageError
that isn't accounted for, but I don't know if that's even possible 0_o
It seems that most of StorageError
variants are fatal errors (corrupted database etc):
/// Errors which may occur during working with trie storages, storing
/// trie values (trie nodes and state values) by their hashes.
#[derive(Debug, Clone, PartialEq, Eq)]
pub enum StorageError {
/// Key-value db internal failure
StorageInternalError,
/// Requested trie value by its hash which is missing in storage.
MissingTrieValue(MissingTrieValueContext, CryptoHash),
/// Found trie node which shouldn't be part of state. Raised during
/// validation of state sync parts where incorrect node was passed.
/// TODO (#8997): consider including hash of trie node.
UnexpectedTrieValue,
/// Either invalid state or key-value db is corrupted.
/// For PartialStorage it cannot be corrupted.
/// Error message is unreliable and for debugging purposes only. It's also probably ok to
/// panic in every place that produces this error.
/// We can check if db is corrupted by verifying everything in the state trie.
StorageInconsistentState(String),
/// Flat storage error, meaning that it doesn't support some block anymore.
/// We guarantee that such block cannot become final, thus block processing
/// must resume normally.
FlatStorageBlockNotSupported(String),
/// In-memory trie could not be loaded for some reason.
MemTrieLoadingError(String),
}
Maybe we shouldn't panic on UnexpectedTrieValue
, that looks like something that could be triggered by an invalid witness. But OTOH we have tests which check for this error and they don't trigger the panic, so it's probably a different code path. /cc @Longarithm
Idk, it doesn't feel very productive to read the code in hopes of finding of a possible panic. AFAIU the panic that spawned this issue can't happen on master
, so there's nothing concrete to fix.
I remember that we wanted to fuzz the validation code, maybe that'd be a quicker way to find possible crashes in validation? And good validation tests would ensure that the validation doesn't crash in the future, when the code changes.
Made an issue about fuzzing: https://github.com/near/nearcore/issues/11132
Issue is not valid anymore. Closing it
Relevant discussion: link
As of now, when a chunk validator receives an invalid state witness (e.g. missing contract code), it will panic and crash. We should change the behavior so it simply logs the issue but move on without panicking.