matter-labs / zksync-era

zkSync era
Apache License 2.0
3.08k stars 2.07k forks source link

external-node Unknown rpc request causes the node to crash #75

Open rayn316 opened 1 year ago

rayn316 commented 1 year ago

I don't know what request caused it, I only found the log before the crash.

log info

2023-08-09T09:22:37.567529Z  INFO zksync_core::api_server::execution_sandbox::vm_metrics: Tx 0x0000000000000000000000000000000000000000-0 resulted in 1667 storage_invocations, 357 new_storage_invocations, 1663 get_value_storage_invocations, 4 set_value_storage_invocations, vm execution tool 49.016566ms, storage interaction took 20.88678ms (missed: 20.679755ms get: 20.731448ms set: 155.332µs)
2023-08-09T09:22:42.661222Z  INFO zksync_core::sync_layer::fetcher: New miniblock: 10448750 / 10448750
2023-08-09T09:22:42.741000Z  INFO zksync_core::sync_layer::external_io: Sealing miniblock
2023-08-09T09:22:42.741743Z  INFO zksync_core::state_keeper::io::seal_logic: Sealing miniblock 10448750 (L1 batch 123563) with 1 (1 L2 + 0 L1) txs, 3 events, 75 reads, 11 writes
2023-08-09T09:22:42.787278Z  INFO zksync_core::sync_layer::external_io: Miniblock 10448751 is sealed
thread 'tokio-runtime-worker' panicked at 'Unexpected error in the fetcher: RPC call failed: ErrorObject { code: InternalError, message: "Internal error", data: None }', /usr/src/zksync/core/bin/zksync_core/src/sync_layer/fetcher.rs:86:21
stack backtrace:
   0:     0x56148e4e832a - std::backtrace_rs::backtrace::libunwind::trace::h79937bc171ada62c
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x56148e4e832a - std::backtrace_rs::backtrace::trace_unsynchronized::h2292bca8571cb919
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x56148e4e832a - std::sys_common::backtrace::_print_fmt::h9c461f248e4ae90d
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:65:5
   3:     0x56148e4e832a - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::he9fe6bf1a39182e1
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x56148e50d40e - core::fmt::write::h032658c119c720d7
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/fmt/mod.rs:1208:17
   5:     0x56148e4e2965 - std::io::Write::write_fmt::h299fc90dfae41c0d
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/io/mod.rs:1682:15
   6:     0x56148e4e80f5 - std::sys_common::backtrace::_print::heb70d25df9937e3f
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:47:5
   7:     0x56148e4e80f5 - std::sys_common::backtrace::print::had745c0a76b8b521
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:34:9
   8:     0x56148e4e9c2f - std::panicking::default_hook::{{closure}}::h1ea782cdfa2fd097
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:267:22
   9:     0x56148e4e996b - std::panicking::default_hook::h1cc3af63455a163c
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:286:9
  10:     0x56148e4ea33c - std::panicking::rust_panic_with_hook::h5cafdc4b3bfd5528
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:688:13
  11:     0x56148e4ea0d9 - std::panicking::begin_panic_handler::{{closure}}::hf31c60f40775892c
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:579:13
  12:     0x56148e4e87dc - std::sys_common::backtrace::__rust_end_short_backtrace::h28a5c7be595826cd
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys_common/backtrace.rs:137:18
  13:     0x56148e4e9de2 - rust_begin_unwind
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/panicking.rs:575:5
  14:     0x56148cc5d473 - core::panicking::panic_fmt::h8fa27a0b37dd98b7
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/core/src/panicking.rs:64:14
  15:     0x56148cd92a52 - zksync_core::sync_layer::fetcher::MainNodeFetcher::run::{{closure}}::h5fed416b84586fcf
  16:     0x56148cdd8555 - tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut::ha65c9cd24ae0eb7d
  17:     0x56148cf62127 - tokio::runtime::task::core::Core<T,S>::poll::h9bda204f98de056f
  18:     0x56148cde0d9e - tokio::runtime::task::harness::Harness<T,S>::poll::h5f7c048a5bb9247a
  19:     0x56148e4a256e - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::h6fb307488dc375ee
  20:     0x56148e4a16c3 - tokio::runtime::scheduler::multi_thread::worker::Context::run::h175658cfae89590d
  21:     0x56148e48ddf9 - tokio::macros::scoped_tls::ScopedKey<T>::set::h2b771be14cbef94d
  22:     0x56148e4a13c9 - tokio::runtime::scheduler::multi_thread::worker::run::ha39c9ec4dce0c89d
  23:     0x56148e4ab1e8 - tokio::runtime::task::core::Core<T,S>::poll::h6feb9b8dd5ca027e
  24:     0x56148e481bdf - tokio::runtime::task::harness::Harness<T,S>::poll::hb45689cdecd9b901
  25:     0x56148e48f888 - tokio::runtime::blocking::pool::Inner::run::hc09fcd48a7633fbf
  26:     0x56148e490d7a - std::sys_common::backtrace::__rust_begin_short_backtrace::hf8e4ebb56e2acb86
  27:     0x56148e4a3a8b - core::ops::function::FnOnce::call_once{{vtable.shim}}::h4834ddb4b804c368
  28:     0x56148e4ed713 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hb77d8d72ebcf79c4
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/boxed.rs:2000:9
  29:     0x56148e4ed713 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hc08c3353e1568487
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/alloc/src/boxed.rs:2000:9
  30:     0x56148e4ed713 - std::sys::unix::thread::Thread::new::thread_start::h7168e596cd5e5ce6
                               at /rustc/d5a82bbd26e1ad8b7401f6a718a9c57c96905483/library/std/src/sys/unix/thread.rs:108:17
  31:     0x7f51db583fa3 - start_thread
  32:     0x7f51db32b06f - clone
  33:                0x0 - <unknown>
2023-08-09T09:22:46.579636Z  INFO zksync_core::metadata_calculator::updater: Stop signal received, metadata_calculator is shutting down
2023-08-09T09:22:46.580190Z  INFO zksync_storage::db: Waiting for all the RocksDB instances to be dropped, 2 remaining
2023-08-09T09:22:46.582876Z  INFO zksync_core::api_server::web3: Stop signal received, WS JSON RPC API is shutting down
2023-08-09T09:22:46.633009Z  INFO zksync_storage::db: Waiting for all the RocksDB instances to be dropped, 1 remaining
2023-08-09T09:22:46.689619Z  INFO zksync_core::api_server::web3::pubsub_notifier: Stop signal received, pubsub_tx_notifier is shutting down
2023-08-09T09:22:46.689687Z  INFO zksync_core::api_server::web3::pubsub_notifier: Stop signal received, pubsub_block_notifier is shutting down
2023-08-09T09:22:46.695252Z  INFO zksync_core::api_server::web3::pubsub_notifier: Stop signal received, pubsub_logs_notifier is shutting down
2023-08-09T09:22:46.829614Z  INFO zksync_core::state_keeper::batch_executor: State keeper exited with an unfinished batch
2023-08-09T09:22:46.829779Z  INFO zksync_core::state_keeper::keeper: Stop signal received, state keeper is shutting down
2023-08-09T09:22:46.843517Z  INFO zksync_storage::db: All the RocksDB instances are dropped
2023-08-09T09:22:47.138197Z  INFO zksync_core::sync_layer::batch_status_updater: Stop signal received, exiting the batch status updater routine
2023-08-09T09:22:50.328840Z  INFO zksync_core::consistency_checker: Stop signal received, consistency_checker is shutting down
2023-08-09T09:22:50.513969Z  INFO zksync_core::l1_gas_price::main_node_fetcher: Stop signal received, MainNodeGasPriceFetcher is shutting down
james0p0 commented 1 year ago

Hello @sundafa May I know how could you start the external node, I followed the instruction on GitHub, but it crash right away. Thank you in advance.

rayn316 commented 1 year ago

@james0p0 i don't know how to reply you. Running this node requires a database snapshot file to run, maybe you should contact the official

githubdoramon commented 1 year ago

Hello @sundafa May I know how could you start the external node, I followed the instruction on GitHub, but it crash right away. Thank you in advance.

@james0p0 the node can't be run permissionlessly yet, but will be soon.

popzxc commented 11 months ago

To the original issue: RPC errors indeed shouldn't crash the node. We'll investigate. Thanks for the report.

RedaOps commented 11 months ago

To the original issue: RPC errors indeed shouldn't crash the node. We'll investigate. Thanks for the report.

Seems like this is an issue with the EN when syncing from the main node. Maybe an unhandled error type when calling a certain rpc method on the main node while fetching?

The fetcher can handle 2 types of RpcError errors:

https://github.com/matter-labs/zksync-era/blob/38fb8787c7255dba54a345b3edbee47f335f0862/core/lib/zksync_core/src/sync_layer/fetcher.rs#L110-L120

But could a different type of error get there? Something like ErrorCode::InternalError or Web3Error::InternalError if the jsonrpc request fails on the server, without it being a network error?

If the following line unwraps to an Err, and the Err is not Some(err @ RpcError::Transport(_) | err @ RpcError::RequestTimeout), the Unexpected error in the fetcher error that is shown in the original isssue's logs could be reached? We can follow the error propagation chain:

https://github.com/matter-labs/zksync-era/blob/38fb8787c7255dba54a345b3edbee47f335f0862/core/lib/zksync_core/src/sync_layer/fetcher.rs#L136

https://github.com/matter-labs/zksync-era/blob/38fb8787c7255dba54a345b3edbee47f335f0862/core/lib/zksync_core/src/sync_layer/client.rs#L138-L141

https://github.com/matter-labs/zksync-era/blob/38fb8787c7255dba54a345b3edbee47f335f0862/core/lib/zksync_core/src/api_server/web3/backend_jsonrpsee/namespaces/eth.rs#L23-L25

https://github.com/matter-labs/zksync-era/blob/38fb8787c7255dba54a345b3edbee47f335f0862/core/lib/zksync_core/src/api_server/web3/backend_jsonrpsee/mod.rs#L15-L42

Also, this type of error handling also happens in the reorg detector, so maybe a similar kind of panic can also happen there?

https://github.com/matter-labs/zksync-era/blob/38fb8787c7255dba54a345b3edbee47f335f0862/core/lib/zksync_core/src/reorg_detector/mod.rs#L127-L141

Please keep in mind I am new to this codebase and don't have the snapshot so cannot run the external node myself to test everything, but hope this info at least makes debugging a little easier.

If I missed anything or got something wrong please point it out. Thanks!

RomanBrodetski commented 11 months ago

Hello @RedaOps thank you very much for a deep dive there! We'll analyse this and get back to you and to this issue.

RedaOps commented 11 months ago

@RomanBrodetski If it's confirmed that the bug is caused by my comment above, how could I get the snapshot to run my local external node and test stuff? We can decide on a fix and I can open a PR that fixes this. Also, there is a comment there that a fibonacci backoff could be implemented for network errors: I could implement that as well if the code will not become deprecated in favor of the consensus node.

SD170 commented 8 months ago

Hello @sundafa May I know how could you start the external node, I followed the instruction on GitHub, but it crash right away. Thank you in advance.

@james0p0 the node can't be run permissionlessly yet, but will be soon.

So, can we run the external node permissionlessly now??