near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 615 forks source link

Crash in test validator_switch_key #11750

Open Trisfald opened 1 month ago

Trisfald commented 1 month ago

Describe the bug First, a recap of the test scenario:

The test fails because of a panic firing while processing chunk endorsements. For some reasons the node own endorsements aren't always 'valid' as previously thought. The validity check fails here.

I think the node could crash in production if something similar to the above happens. I'm afraid it might continue to crash until the next epoch kicks-in.

Now, the interesting part. The chunk endorsement is broken only for the first 2 epochs, starting from the 3 epoch everything is fine. So I guess it has something to do with the fact the validator key has been changed, and the change is not 'accounted for' immediately?

To Reproduce Run the test validator_switch_key, either on nayduck or locally. Example: python3 pytest/tests/sanity/validator_switch_key.py

Expected behavior The test passes. The node doesn't crash.

Screenshots None

Version (please complete the following information):

Additional context

Stacktrace ```bash 0: rust_begin_unwind at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:652:5 1: core::panicking::panic_fmt at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panicking.rs:72:14 2: core::result::unwrap_failed at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1654:5 3: core::result::Result::unwrap at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/result.rs:1077:23 4: near_client::stateless_validation::chunk_validator::send_chunk_endorsement_to_block_producers at ./chain/client/src/stateless_validation/chunk_validator/mod.rs:229:13 5: near_client::stateless_validation::state_witness_producer::::send_chunk_state_witness_to_chunk_validators at ./chain/client/src/stateless_validation/state_witness_producer.rs:66:13 6: near_client::client::Client::produce_chunks at ./chain/client/src/client.rs:1846:39 7: near_client::client::Client::on_block_accepted_with_optional_chunk_produce at ./chain/client/src/client.rs:1691:17 8: near_client::client::Client::postprocess_ready_blocks at ./chain/client/src/client.rs:1380:13 9: near_client::client_actor::ClientActorInner::try_process_unfinished_blocks at ./chain/client/src/client_actor.rs:1258:41 10: near_client::client_actor::ClientActorInner::check_triggers at ./chain/client/src/client_actor.rs:1192:9 11: ::wrap_handler at ./chain/client/src/client_actor.rs:291:9 12: >::handle at ./core/async/src/messaging.rs:53:9 13: as actix::handler::Handler>>::handle at ./core/async/src/actix_wrapper.rs:59:9 14: as actix::address::envelope::EnvelopeProxy>::handle at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/actix-0.13.0/src/address/envelope.rs:80:23 15: as actix::address::envelope::EnvelopeProxy>::handle at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/actix-0.13.0/src/address/envelope.rs:53:9 16: actix::mailbox::Mailbox::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/actix-0.13.0/src/mailbox.rs:84:21 17: as core::future::future::Future>::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/actix-0.13.0/src/contextimpl.rs:381:13 18: tokio::runtime::task::core::Core::poll::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/core.rs:311:17 19: tokio::loom::std::unsafe_cell::UnsafeCell::with_mut at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/loom/std/unsafe_cell.rs:14:9 20: tokio::runtime::task::core::Core::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/core.rs:300:13 21: tokio::runtime::task::harness::poll_future::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/harness.rs:476:19 22: as core::ops::function::FnOnce<()>>::call_once at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/panic/unwind_safe.rs:272:9 23: std::panicking::try::do_call at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:559:40 24: std::panicking::try at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panicking.rs:523:19 25: std::panic::catch_unwind at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/panic.rs:149:14 26: tokio::runtime::task::harness::poll_future at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/harness.rs:464:18 27: tokio::runtime::task::harness::Harness::poll_inner at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/harness.rs:198:27 28: tokio::runtime::task::harness::Harness::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/harness.rs:152:15 29: tokio::runtime::task::raw::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/raw.rs:276:5 30: tokio::runtime::task::raw::RawTask::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/raw.rs:200:18 31: tokio::runtime::task::LocalNotified::run at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/task/mod.rs:400:9 32: tokio::task::local::LocalSet::tick::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:615:63 33: tokio::runtime::coop::with_budget at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:107:5 34: tokio::runtime::coop::budget at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:73:5 35: tokio::task::local::LocalSet::tick at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:615:31 36: as core::future::future::Future>::poll::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:927:16 37: tokio::task::local::LocalSet::with::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:684:13 38: std::thread::local::LocalKey::try_with at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:286:12 39: std::thread::local::LocalKey::with at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:262:9 40: tokio::task::local::LocalSet::with at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:667:9 41: as core::future::future::Future>::poll at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:913:9 42: tokio::task::local::LocalSet::run_until::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:573:19 43: as core::future::future::Future>::poll at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/core/src/future/future.rs:123:9 44: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}}::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:651:57 45: tokio::runtime::coop::with_budget at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:107:5 46: tokio::runtime::coop::budget at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/coop.rs:73:5 47: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}}::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:651:25 48: tokio::runtime::scheduler::current_thread::Context::enter at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:410:19 49: tokio::runtime::scheduler::current_thread::CoreGuard::block_on::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:650:36 50: tokio::runtime::scheduler::current_thread::CoreGuard::enter::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:729:68 51: tokio::runtime::context::scoped::Scoped::set at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context/scoped.rs:40:9 52: tokio::runtime::context::set_scheduler::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context.rs:176:26 53: std::thread::local::LocalKey::try_with at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:286:12 54: std::thread::local::LocalKey::with at /rustc/129f3b9964af4d4a709d1383930ade12dfe7c081/library/std/src/thread/local.rs:262:9 55: tokio::runtime::context::set_scheduler at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context.rs:176:9 56: tokio::runtime::scheduler::current_thread::CoreGuard::enter at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:729:27 57: tokio::runtime::scheduler::current_thread::CoreGuard::block_on at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:638:19 58: tokio::runtime::scheduler::current_thread::CurrentThread::block_on::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:175:28 59: tokio::runtime::context::runtime::enter_runtime at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/context/runtime.rs:65:16 60: tokio::runtime::scheduler::current_thread::CurrentThread::block_on at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/scheduler/current_thread.rs:167:9 61: tokio::runtime::runtime::Runtime::block_on at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/runtime/runtime.rs:311:47 62: tokio::task::local::LocalSet::block_on at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/tokio-1.29.1/src/task/local.rs:534:9 63: actix_rt::runtime::Runtime::block_on at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/actix-rt-2.7.0/src/runtime.rs:80:9 64: actix_rt::arbiter::Arbiter::with_tokio_rt::{{closure}} at /home/aspurio/.cargo/registry/src/index.crates.io-6f17d22bba15001f/actix-rt-2.7.0/src/arbiter.rs:144:21 ```
staffik commented 1 month ago

@jancionear I believe you assigned me because I was working on hot-swap validator key (pytest/tests/sanity/validator_switch_key_quick.py). But this is entirely different logic.

@Trisfald are you still looking into this nayduck test? If not, I can take it.

Trisfald commented 1 month ago

Hey @staffik please feel free to have a look, I did try to dig into it a little bit, which brought me to #11751. The latter doesn't solve the root cause since it just avoid a crash. This's enough to make the test "pass" and to make a small network "work". But I'm afraid there might be more subtle issues hidden in there

jancionear commented 1 month ago

@jancionear I believe you assigned me because I was working on hot-swap validator key (pytest/tests/sanity/validator_switch_key_quick.py). But this is entirely different logic.

Ah sorry I confused it with test_validator_switch_key_quick

shreyan-gupta commented 1 month ago

If the failure happens within verify_chunk_endorsement, I suspect it might be because epoch_info isn't properly initialized for the current and next epoch after the key switch. This also explains why we are seeing the error only in the first two epochs and not after.

Trisfald commented 1 month ago

11751 fixes the "node crashing", but I think the root cause is not yet fixed

After #11751 the validator will fail to validate its own endorsements for 1 epoch, but it won't crash

A further improvement would be to initialize epoch_info (or the relevant struct) properly after node restart, so that own endorsements are always valid.