iotaledger / bee

A framework for IOTA nodes, clients and applications in Rust
Apache License 2.0
273 stars 83 forks source link

Crash after x hours running during testing autopeering #1002

Closed TeeVeeEss closed 2 years ago

TeeVeeEss commented 2 years ago

Bug description

Testing the autopeering. After 5 hours running the node crashes

Rust version

Which version of Rust are you running?

Hardware specification

What hardware are you using?

Steps To reproduce the bug

  1. Enable autopeering
  2. When needed add 1 static peer to get synced
  3. After x hours the node crashes. Seen it twice now, once after 12 hours, other after 5 hours.

Expected behaviour

Node should be finding enough peers to keep synced

Actual behaviour

Crash after x hours.

Errors

tokio-runtime-worker' panicked at 'error sending response signal: [10, 32, 133, 47, 198, 27, 9, 63, 90, 178, 253, 236, 228, 11, 47, 209, 146, 16, 186, 3, 121, 180, 161, 205, 112, 192, 118, 49, 238, 23, 181, 98, 5, 215, 18, 52, 10, 19, 10, 7, 112, 101, 101, 114, 105, 110, 103, 18, 8, 10, 3, 117, 100, 112, 16, 162, 114, 10, 29, 10, 17, 99, 104, 114, 121, 115, 97, 108, 105, 115, 45, 109, 97, 105, 110, 110, 101, 116, 18, 8, 10, 3, 116, 99, 112, 16, 240, 121, 26, 12, 57, 52, 46, 50, 49, 50, 46, 51, 49, 46, 50, 52]', C:\Users\user\.cargo\git\checkouts\bee-781893e298f3299f\10914be\bee-network\bee-autopeering\src\discovery\manager.rs:686:14 Last ~700 lines from debug-logging and actual crash in attached logfile. config.toml also attached as .txt bee_crash_autopeering_testing.log config_anon.toml.txt

thibault-martinez commented 2 years ago

Hi @TeeVeeEss did this happen again ?

TeeVeeEss commented 2 years ago

Tried to update to latest dev but having a build error:

error[E0282]: type annotations needed
   --> C:\Users\user\.cargo\registry\src\github.com-1ecc6299db9ec823\libp2p-noise-0.32.0\src\protocol\x25519.rs:221:45
    |
221 |         curve25519_sk.copy_from_slice(&hash.as_ref()[..32]);
    |                                        -----^^^^^^--
    |                                        |    |
    |                                        |    cannot infer type for type parameter `T` declared on the trait `AsRef`
    |                                        this method call resolves to `&T`
    |
    = note: type must be known at this point

   Compiling libp2p-mplex v0.29.0
   Compiling libp2p-yamux v0.33.0
For more information about this error, try `rustc --explain E0282`.
error: could not compile `libp2p-noise` due to previous error
warning: build failed, waiting for other jobs to finish...
error: build failed
PS D:\github\bee> git show
commit e836fde6de7b9df4bc0a6e447f67b9cab16c76ce (HEAD -> dev, tag: v0.3.0-rc4, origin/dev)
Author: Thibault Martinez <thibault@iota.org>
Date:   Fri Feb 11 18:19:35 2022 +0100
Date:   Fri Feb 11 18:19:35 2022 +0100
Date:   Fri Feb 11 18:19:35 2022 +0100
Date:   Fri Feb 11 18:19:35 2022 +0100
Date:   Fri Feb 11 18:19:35 2022 +0100
Date:   Fri Feb 11 18:19:35 2022 +0100

    bee-node-0.3.0-rc4 (#1131)

diff --git a/Cargo.lock b/Cargo.lock
index 58ad41ad..691578ec 100644
--- a/Cargo.lock
+++ b/Cargo.lock
@@ -315,7 +315,7 @@ dependencies = [

 [[package]]
 name = "bee-node"
-version = "0.3.0-rc3"
+version = "0.3.0-rc4"
 dependencies = [
  "anymap",
  "async-trait",
diff --git a/bee-node/Cargo.toml b/bee-node/Cargo.toml
index daf2deb4..075d06b2 100644
--- a/bee-node/Cargo.toml
+++ b/bee-node/Cargo.toml
@@ -1,6 +1,6 @@
 [package]
 name = "bee-node"
-version = "0.3.0-rc3"
+version = "0.3.0-rc4"
 authors = [ "IOTA Stiftung" ]
 edition = "2021"
 description = "Default node implementations."
thibault-martinez commented 2 years ago

Try to update your rust version ?

TeeVeeEss commented 2 years ago
PS D:\github\bee\bee-node> rustup update
info: syncing channel updates for 'stable-x86_64-pc-windows-msvc'
info: checking for self-updates

  stable-x86_64-pc-windows-msvc unchanged - rustc 1.58.1 (db9d1b20b 2022-01-20)

info: cleaning up downloads & tmp directories`

Looks fine I guess?

TeeVeeEss commented 2 years ago

Did some debugging. Seems that running cargo update after the git pull is causing the build error. Node is now running version 0.3.0-rc4-e836fde, but still having troubles (getting autopeers, but no traffic with that peers) with my network setup on this W11-host.

TeeVeeEss commented 2 years ago

Another crash with what looks like the same error earlier reported in this issue, see attached file.

2022-02-16 05:05:12 (UTC) bee_autopeering::discovery::manager        DEBUG Verification response timeout for 12D3KooWDeQbrdgC: deadline has elapsed
thread '2022-02-16 05:05:12 (UTC) bee_autopeering::discovery::query          DEBUG Failed to reverify 12D3KooWDeQbrdgC. Removing peer.
tokio-runtime-worker' panicked at 'error sending response signal: [10, 32, 151, 147, 202, 62, 173, 161, 216, 28, 228, 209, 134, 120, 178, 207, 72, 221, 215, 74, 224, 135, 213, 145, 161, 199, 206, 134, 150, 187, 204, 183, 46, 128, 18, 52, 10, 29, 10, 17, 99, 104, 114, 121, 115, 97, 108, 105, 115, 45, 109, 97, 105, 110, 110, 101, 116, 18, 8, 10, 3, 116, 99, 112, 16, 244, 121, 10, 19, 10, 7, 112, 101, 101, 114, 105, 110, 103, 18, 8, 10, 3, 117, 100, 112, 16, 164, 114, 26, 12, 57, 52, 46, 50, 49, 50, 46, 51, 49, 46, 50, 52]', bee-network\bee-autopeering\src\discovery\manager.rs:686:14

versions:

PS D:\github\bee\bee-node> ..\target\release\bee.exe -v
0.3.0-rc4-e836fde
PS D:\github\bee\bee-node> cargo -V
cargo 1.58.0 (f01b232bc 2022-01-19)
PS D:\github\bee\bee-node> rustc -V
rustc 1.58.1 (db9d1b20b 2022-01-20)

bee_crash_2_autopeering_testing.log

TeeVeeEss commented 2 years ago

Solved my network issues (portforwarding with 2 nics is tricky ;) ). The bee is now getting traffic from some autopeers, but also still a lot of autopeers without traffic. Will let it run now for some time to see if it is stable.

TeeVeeEss commented 2 years ago

Did some more debugging. Looks like bee running native on W11 is having trouble to keep or to build the connection to autopeers, lot's of messages like this:

022-02-19 16:42:23 (UTC) bee_gossip::service::host                  INFO  Connected peers: known 1/1 unknown 2/4 discovered 0/8.
2022-02-19 16:42:23 (UTC) bee_gossip::service::host                  INFO  Trying to reconnect to: DzyszB (DzyszB).
2022-02-19 16:42:23 (UTC) bee_gossip::service::host                  INFO  Trying to reconnect to: StYovd (StYovd).
2022-02-19 16:42:23 (UTC) bee_gossip::service::host                  INFO  Trying to reconnect to: mUi9Dn (mUi9Dn).
2022-02-19 16:42:23 (UTC) bee_gossip::service::host                  INFO  Trying to reconnect to: HEyAEj (HEyAEj).
2022-02-19 16:42:23 (UTC) bee_gossip::service::host                  INFO  Trying to reconnect to: vrkBRi (vrkBRi).
2022-02-19 16:42:23 (UTC) bee_gossip::network::host                  INFO  Dialing peer: DzyszB (DzyszB).
2022-02-19 16:42:23 (UTC) bee_gossip::swarm::protocols::iota_gossip::protocol DEBUG gossip protocol: new handler (1782).
2022-02-19 16:42:23 (UTC) bee_gossip::network::host                  INFO  Dialing peer: StYovd (StYovd).
2022-02-19 16:42:23 (UTC) bee_gossip::swarm::protocols::iota_gossip::protocol DEBUG gossip protocol: new handler (1783).
2022-02-19 16:42:23 (UTC) bee_gossip::network::host                  INFO  Dialing peer: mUi9Dn (mUi9Dn).
2022-02-19 16:42:23 (UTC) libp2p_dns                                 DEBUG Dialing /ip4/78.47.156.54/tcp/15600
2022-02-19 16:42:23 (UTC) bee_gossip::swarm::protocols::iota_gossip::protocol DEBUG gossip protocol: new handler (1784).
2022-02-19 16:42:23 (UTC) libp2p_tcp                                 DEBUG dialing 78.47.156.54:15600
2022-02-19 16:42:23 (UTC) bee_gossip::network::host                  INFO  Dialing peer: HEyAEj (HEyAEj).
2022-02-19 16:42:23 (UTC) libp2p_dns                                 DEBUG Dialing /ip4/135.181.149.155/tcp/15600
2022-02-19 16:42:23 (UTC) bee_gossip::swarm::protocols::iota_gossip::protocol DEBUG gossip protocol: new handler (1785).
2022-02-19 16:42:23 (UTC) libp2p_tcp                                 DEBUG dialing 135.181.149.155:15600
2022-02-19 16:42:23 (UTC) bee_gossip::network::host                  INFO  Dialing peer: vrkBRi (vrkBRi).
2022-02-19 16:42:23 (UTC) libp2p_dns                                 DEBUG Dialing /ip4/202.61.193.177/tcp/15600
2022-02-19 16:42:23 (UTC) bee_gossip::swarm::protocols::iota_gossip::protocol DEBUG gossip protocol: new handler (1786).
2022-02-19 16:42:23 (UTC) libp2p_tcp                                 DEBUG dialing 202.61.193.177:15600
2022-02-19 16:42:23 (UTC) libp2p_dns                                 DEBUG Dialing /ip4/144.91.116.35/tcp/15600
2022-02-19 16:42:23 (UTC) libp2p_dns                                 DEBUG Dialing /ip4/5.161.49.154/tcp/15600
2022-02-19 16:42:23 (UTC) libp2p_tcp                                 DEBUG dialing 144.91.116.35:15600
2022-02-19 16:42:23 (UTC) libp2p_tcp                                 DEBUG dialing 5.161.49.154:15600
2022-02-19 16:42:28 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWPjSk4PY9: Peer is too far away.
2022-02-19 16:42:29 (UTC) bee_autopeering::discovery::query          DEBUG Reverified 12D3KooWSZSnWF4o. Peer offers 2 service/s: peering/udp/14626;chrysalis-mainnet/tcp/15600
2022-02-19 16:42:29 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWPjSk4PY9: Peer is too far away.
2022-02-19 16:42:30 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWPjSk4PY9: Peer is too far away.
2022-02-19 16:42:32 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWFqC4TM3L: Peer is too far away.
2022-02-19 16:42:33 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWFqC4TM3L: Peer is too far away.
2022-02-19 16:42:33 (UTC) bee_protocol::workers::metrics             INFO  NodeMetrics { invalid_packets: 0, milestone_requests_received: 7, messages_received: 399021, message_requests_received: 0, heartbeats_received: 6402, milestone_requests_sent: 469, messages_sent: 272059, message_requests_sent: 58547, heartbeats_sent: 7118, invalid_messages: 0, new_messages: 156046, known_messages: 242975, messages_average_latency: 122, referenced_messages: 155834, excluded_no_transaction_messages: 155265, excluded_conflicting_messages: 91, included_messages: 478, created_outputs: 691, consumed_outputs: 691, receipts: 3, transaction_payloads: 570, milestone_payloads: 1216, indexation_payloads: 154660, snapshots: 0, prunings: 0 }
2022-02-19 16:42:33 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWE9zFA7wE: Peer is too far away.
2022-02-19 16:42:33 (UTC) libp2p_swarm                               DEBUG Connection attempt to address "/ip4/78.47.156.54/tcp/15600" of unknown peer failed with Transport(Other(Custom { kind: Other, error: Timeout }))
2022-02-19 16:42:33 (UTC) libp2p_swarm                               DEBUG Connection attempt to address "/ip4/135.181.149.155/tcp/15600" of unknown peer failed with Transport(Other(Custom { kind: Other, error: Timeout }))
2022-02-19 16:42:33 (UTC) libp2p_swarm                               DEBUG Connection attempt to address "/ip4/5.161.49.154/tcp/15600" of unknown peer failed with Transport(Other(Custom { kind: Other, error: Timeout }))
2022-02-19 16:42:33 (UTC) libp2p_swarm                               DEBUG Connection attempt to address "/ip4/202.61.193.177/tcp/15600" of unknown peer failed with Transport(Other(Custom { kind: Other, error: Timeout }))
2022-02-19 16:42:33 (UTC) libp2p_swarm                               DEBUG Connection attempt to address "/ip4/144.91.116.35/tcp/15600" of unknown peer failed with Transport(Other(Custom { kind: Other, error: Timeout }))
2022-02-19 16:42:34 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWFqC4TM3L: Peer is too far away.
2022-02-19 16:42:34 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWE9zFA7wE: Peer is too far away.
2022-02-19 16:42:35 (UTC) bee_autopeering::peering::manager          DEBUG Denying peering request from 12D3KooWE9zFA7wE: Peer is too far away.
2022-02-19 16:42:36 (UTC) bee_autopeering::discovery::query          INFO  Querying 2 peer/s...
2022-02-19 16:42:36 (UTC) bee_autopeering::discovery::query          DEBUG Query successful. Received 6 peers.
2022-02-19 16:42:36 (UTC) bee_autopeering::discovery::manager        DEBUG Added (unverified): 12D3KooWGZ1PnsCc.
2022-02-19 16:42:36 (UTC) bee_autopeering::discovery::query          DEBUG Query successful. Received 6 peers.
2022-02-19 16:42:40 (UTC) bee_autopeering::discovery::manager        DEBUG Verification response timeout for 12D3KooWFAo8yTjU: deadline has elapsed
2022-02-19 16:42:40 (UTC) bee_autopeering::discovery::query          DEBUG Failed to reverify 12D3KooWFAo8yTjU. Removing peer.
2022-02-19 16:42:41 (UTC) bee_autopeering::discovery::manager        DEBUG Added unknown and unverified 12D3KooWKwYHn3Jw.

Running hornet on the same host with the same ports (either in WSL or in a VM) has no problems with autopeering, so the network setup should be good. Sometimes a connected autopeer stays connected and generates traffic but most are quick disconnected, see attached video

https://user-images.githubusercontent.com/17521100/154810680-b717cdaa-ead1-4eb5-9e2d-e04ff3bfac5b.mp4

Usually ending with a lot of peers in an unconnnected state with an unknown relation:

curl http://192.168.178.165:14268/api/v1/peers|jq '.data[] | {connected: .connected, relation: .relation}'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  7042  100  7042    0     0   982k      0 --:--:-- --:--:-- --:--:--  982k
{
  "connected": false,
  "relation": "unknown"
}
{
  "connected": false,
  "relation": "autopeered"
}
{
  "connected": true,
  "relation": "known"
}
{
  "connected": false,
  "relation": "unknown"
}
{
  "connected": false,
  "relation": "unknown"
}
{
  "connected": false,
  "relation": "autopeered"
}
{
  "connected": false,
  "relation": "unknown"
}
{
  "connected": false,
  "relation": "autopeered"
}
{
  "connected": false,
  "relation": "autopeered"
}
{
  "connected": false,
  "relation": "autopeered"
}
{
  "connected": false,
  "relation": "unknown"
}
{
  "connected": false,
  "relation": "unknown"
}

Hope you can do something with this info.

thibault-martinez commented 2 years ago

Yeah I noticed the same regarding the disconnected peers, I think @Alex6323 is on it. Will update you ASAP.

TeeVeeEss commented 2 years ago

A few updates later, running v0.3.0-rc6 now, but stiill almost always ending in crash after some hours with this stacktrace:

2022-03-24 13:38:21 (UTC) bee_autopeering::peering::manager          DEBUG Peering rejected by Bw3s9arKc1MTrJv2.
2022-03-24 13:38:22 (UTC) bee_autopeering::peering::manager          DEBUG Peering response timeout: deadline has elapsed
2022-03-24 13:38:22 (UTC) bee_autopeering::peering::manager          DEBUG Peering rejected by EQqFppryqZTjefvF.
thread 'tokio-runtime-worker' panicked at 'error sending response signal: [10, 32, 21, 234, 141, 114, 236, 165, 129, 81, 123, 218, 5, 18, 185, 50, 9, 97, 74, 179, 241, 198, 233, 197, 189, 200, 110, 205, 18, 241, 235, 135, 104, 227]', D:\github\bee\bee-network\bee-autopeering\src\peering\manager.rs:411:41
stack backtrace:
   0:     0x7ff709dadcff - std::backtrace_rs::backtrace::dbghelp::trace
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\..\..\backtrace\src\backtrace\dbghelp.rs:98
   1:     0x7ff709dadcff - std::backtrace_rs::backtrace::trace_unsynchronized
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\..\..\backtrace\src\backtrace\mod.rs:66
   2:     0x7ff709dadcff - std::sys_common::backtrace::_print_fmt
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\sys_common\backtrace.rs:67
   3:     0x7ff709dadcff - std::sys_common::backtrace::_print::impl$0::fmt
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\sys_common\backtrace.rs:46
   4:     0x7ff709dc90da - core::fmt::write
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\core\src\fmt\mod.rs:1168
   5:     0x7ff709da5208 - std::io::Write::write_fmt<std::sys::windows::stdio::Stderr>
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\io\mod.rs:1660
   6:     0x7ff709db031b - std::sys_common::backtrace::_print
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\sys_common\backtrace.rs:49
   7:     0x7ff709db031b - std::sys_common::backtrace::print
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\sys_common\backtrace.rs:36
   8:     0x7ff709db031b - std::panicking::default_hook::closure$1
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\panicking.rs:211
   9:     0x7ff709dafe14 - std::panicking::default_hook
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\panicking.rs:228
  10:     0x7ff709db081c - std::panicking::rust_panic_with_hook
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\panicking.rs:606
  11:     0x7ff709db06db - std::panicking::begin_panic_handler::closure$0
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\panicking.rs:502
  12:     0x7ff709dae647 - std::sys_common::backtrace::__rust_end_short_backtrace<std::panicking::begin_panic_handler::closure$0,never$>
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\sys_common\backtrace.rs:139
  13:     0x7ff709db03d9 - std::panicking::begin_panic_handler
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\panicking.rs:498
  14:     0x7ff70a0ccaf0 - core::panicking::panic_fmt
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\core\src\panicking.rs:116
  15:     0x7ff70a0ccc03 - core::result::unwrap_failed
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\core\src\result.rs:1690
  16:     0x7ff70971e21b - bee_autopeering::peering::manager::handle_peering_response::h65c3f49fc070023a
  17:     0x7ff709898ea9 - <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll::hbe74cc8b0959d24f
  18:     0x7ff709652a18 - tokio::runtime::task::core::CoreStage<T>::poll::hf6321ef978569fcf
  19:     0x7ff7095fbefe - tokio::runtime::task::harness::Harness<T,S>::complete::hfde9048222afa22a
  20:     0x7ff7095e23b4 - tokio::runtime::task::harness::Harness<T,S>::poll::hf41bd4353f9b56c3
  21:     0x7ff709cd34ac - std::thread::local::LocalKey<T>::with::h0be12a2373c7ffe0
  22:     0x7ff709ce5992 - tokio::runtime::thread_pool::worker::Context::run::hcb008923b1c05108
  23:     0x7ff709ce53f5 - tokio::runtime::thread_pool::worker::Context::run::hcb008923b1c05108
  24:     0x7ff709ce4a44 - tokio::runtime::thread_pool::worker::run::hfb1b7ba9d8cf47a8
  25:     0x7ff709ce3b81 - tokio::runtime::task::core::CoreStage<T>::poll::h4af46a4368bfb80e
  26:     0x7ff709ce0efa - tokio::runtime::task::harness::Harness<T,S>::poll::h057e348125901bcb
  27:     0x7ff709cd27ee - tokio::runtime::blocking::pool::Inner::run::h956e25c6e52065f6
  28:     0x7ff709cdca9a - std::sys_common::backtrace::__rust_begin_short_backtrace::hdf91de245b37dc13
  29:     0x7ff709cd8f23 - <tokio::util::slab::Ref<T> as core::ops::drop::Drop>::drop::h7450df57532a978d
  30:     0x7ff709db542c - alloc::boxed::impl$44::call_once
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\library\alloc\src\boxed.rs:1854
  31:     0x7ff709db542c - alloc::boxed::impl$44::call_once
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\library\alloc\src\boxed.rs:1854
  32:     0x7ff709db542c - std::sys::windows::thread::impl$0::new::thread_start
                               at /rustc/9d1b2106e23b1abd32fce1f17267604a5102f57a\/library\std\src\sys\windows\thread.rs:58
  33:     0x7ffcac4f54e0 - BaseThreadInitThunk
  34:     0x7ffcad72485b - RtlUserThreadStart
Alex6323 commented 2 years ago

Can you please add this section to your "config.toml" or the equivalent to your "config.json":

[[logger.outputs]]
name              = "autopeering.log"
level_filter      = "trace"
target_filters    = ["bee_autopeering"]
target_exclusions = []

and send me the log after it crashed through Discord or some other channel? That would be very helpful as the above stack-trace does not show everything I need to track down the issue.

Alex6323 commented 2 years ago

@TeeVeeEss Can you build the node with PR #1276 and report on the crash issue please?

TeeVeeEss commented 2 years ago

Sorry was very busy with day-time job combined with severe cold (non-covid happily). Will add the logging settings and create a new issue when needed.