near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 618 forks source link

Exclude contract code out of state witness & distribute separately #11099

Open walnut-the-cat opened 5 months ago

walnut-the-cat commented 5 months ago

Relevant discussion

Link

Issue

During stateless validation forknet test, we observed node crash with the following error

2024-04-16T20:21:23.545144Z DEBUG chunk_tracing{chunk_hash=HnFSQEoLMEnMXK2pxnnnbv7GkwFobanyrd7JJbNS2Rrj}:new_chunk{shard_id=3}:apply_chunk{shard_id=3}:process_state_update:apply{protocol_version=84 num_transactions=19}:process_receipt{receipt_id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP predecessor=operator.meta-pool.near receiver=lockup-meta-pool.near id=GHhLncT5GM2ksuwVzUqPMkzCp132V7xToQZPfUbKeRgP}:run{code.hash=EXekfV3kpFHHsTi4JUDh2MVLCKS3hpKdPbXMuRirxrvY vm_kind=NearVm}: vm: close time.busy=49.3µs time.idle=3.42µs
thread '<unnamed>' panicked at core/store/src/trie/trie_storage.rs:317:16:
!!!CRASH!!!: MissingTrieValue(TrieMemoryPartialStorage, 5FWvfWAJxH1mbCHuzLGwBfL9EYjH8YWVin6Pmp3H8gdM)
stack backtrace:
   0: rust_begin_unwind
   1: core::panicking::panic_fmt
   2: core::result::unwrap_failed
   3: <near_store::trie::trie_storage::TrieMemoryPartialStorage as near_store::trie::trie_storage::TrieStorage>::retrieve_raw_bytes
   4: near_store::trie::Trie::internal_retrieve_trie_node
   5: near_store::trie::Trie::retrieve_raw_node
   6: near_store::trie::Trie::lookup_from_state_column
   7: near_store::trie::Trie::get_optimized_ref
   8: near_store::trie::Trie::get
   9: near_store::trie::update::TrieUpdate::get
  10: near_store::get_code
  11: node_runtime::actions::execute_function_call
  12: node_runtime::Runtime::apply_action
  13: node_runtime::Runtime::apply_action_receipt
  14: node_runtime::Runtime::apply::{{closure}}
  15: node_runtime::Runtime::apply
  16: <near_chain::runtime::NightshadeRuntime as near_chain::types::RuntimeAdapter>::apply_chunk
  17: near_chain::update_shard::apply_new_chunk
  18: core::ops::function::FnOnce::call_once{{vtable.shim}}
  19: <rayon_core::job::HeapJob<BODY> as rayon_core::job::Job>::execute
  20: rayon_core::registry::WorkerThread::wait_until_cold

@Longarithm mentioned that

  10: near_store::get_code

is due to missing contract doe from state witness.

From debug log, @staffik confirmed that it was likely the case and the crash was happening with different contracts, including lockup-meta-pool.near and pack.promotional.basketball.playible.near

@Longarithm 's understanding of how this can cause node crash is as follows:

Timeline

April 17

@Longarithm is preparing a quick patch to bypass the issue in Forknet for now, but we need a proper solution in place before MainNet launch

April 18

The team had discussion on the proper solution and concluded to separate contract out of state witness. When a chunk validator realizes that it does not have a contract code to validate incoming state witness, it will reactively request missing code to its peers. As a result, chunk miss may happen, but the chunk validator should be compiled contract code ready fur the future validation.

The project involves following works but not limited to:

walnut-the-cat commented 5 months ago

For now, @tayfunelmas will continue making progress on building network message, but @Longarithm will pause and focus on #11124 until we have a clear evidence that including contract code in state witness does not work for MVP launch. Relevant discussion can be found here: link

nagisa commented 2 months ago

Does this issue in principle imply that the contract code would no longer be a part of The State (i.e. no longer stored in the trie, with all that entails), or is this only a partial step towards such a future and the remaining work would need to be documented as a separate issue?