Open deevope opened 2 years ago
This image shows file descriptors being left open in /proc/[pid/fd
, which when counted just happen to be at the user FD limit.
To reproduce, I'd try running a node on a closed network and keep an eye on the number of socket descriptors in /proc/[pid]/fd
, then connect/disconnect peers while checking the count of file descriptors in that directory. Then whatever fix is needed can also be tested that way.
Would focus on trying to reproduce in the pibd_impl
branch, since it's due to be merged (and won't be merged until this is addressed)
I got down to doing a task on my own server and decided to follow the note However, I apparently do not fully understand the technology of connecting peers In the config I have the following settings:
[server.p2p_config]
host = "0.0.0.0"
port = 3414
But when I make:
./grin client unban -p 0.0.0.0:3414
or
./grin client unban -p 127.0.0.1:3414
I have no new peers in my grin server
I up grin server in usernet
I have this reproduced after running a node fresh for a couple of weeks, we can see the file handles being left oper:
The last message in the logs before this starts happening as to do with header sync being enabled, seen this on mainnet and on a testnet node that's starting to leak threads
In any case, the debugging the process reveals plenty of stuck threads that correspond to the open file handles:
With backtrace as follows:
Threads seem to be stuck trying to obtain a read lock of the header pmmr:
Chain::process_block_header, thread peer_read
, stuck waiting for exclusive lock on either header_pmmr or txhashset
peer_write
thread appears to be parked and waiting .. think this is the only instance of a write lock for the header pmmr, so good possibility this is what's blocking all the other threads
chain::Chain::get_header_by_height, thread 'compactor', also appears to be awaiting a read lock on header_pmmr
Chain::get_locator_hashes, thread sync
, also appears to be waiting for a write lock on header_pmmr
NetToChainAdapter, thread peer_read
, wating for read lock on header_pmmr
Also one instance of an exclusive lock waiting for all readers to complete, grin_chain::chain::Chain::process_block_header
In normal operations, when a node falls 5 blocks behind, header sync kicks off followed by a message that syncronization was successful.
20220824 12:49:20.586 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 1965578938893570, peer_difficulty 1965580349437861, threshold 872616680 (last 5 blocks), enabling sync 20220824 12:49:20.590 INFO grin_servers::common::adapters - Received 8 block headers from 172.26.14.223:3414 20220824 12:49:30.595 INFO grin_servers::common::adapters - Received 10 block headers from 172.26.12.78:3414 20220824 12:49:31.898 INFO grin_servers::grin::sync::syncer - synchronized at 1965580705767519 @ 1889250 [000379751ed9]
On nodes where the issue occurs, we see sync being enabled, followed by a 'no outbound peers, considering inbound' message, and then nothing until we start seeing too many open files errors. This would likely indicate the issue is the header_pmmr being write locked in the sync thread yet is unable to complete header syncing.
same issue, more log data:
20230215 16:09:58.375 ERROR grin_util::logger -
thread 'peer_connect' panicked at 'clone conn for reader failed: Os { code: 24, kind: Uncategorized, message: "Too many open files" }': p2p/src/conn.rs:224 0: grin_util::logger::send_panic_to_log::{{closure}}
1: std::panicking::rust_panic_with_hook
at /rustc/abc...abc/library/std/src/panicking.rs:610
2: std::panicking::begin_panic_handler::{{closure}}
at /rustc/abc...abc/library/std/src/panicking.rs:502
3: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/abc...abc/library/std/src/sys_common/backtrace.rs:139
4: rust_begin_unwind
at /rustc/abc...abc/library/std/src/panicking.rs:498
5: core::panicking::panic_fmt
at /rustc/abc...abc/library/core/src/panicking.rs:116
6: core::result::unwrap_failed
at /rustc/abc...abc/library/core/src/result.rs:1690
7: grin_p2p::conn::listen
8: grin_p2p::peer::Peer::new
9: grin_p2p::peer::Peer::connect
10: grin_p2p::serv::Server::connect
11: std::sys_common::backtrace::__rust_begin_short_backtrace
12: core::ops::function::FnOnce::call_once{{vtable.shim}}
13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/abc...abc/library/alloc/src/boxed.rs:1854
<alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
at /rustc/abc...abc/library/alloc/src/boxed.rs:1854
std::sys::unix::thread::Thread::new::thread_start
at /rustc/abc...abc/library/std/src/sys/unix/thread.rs:108
14: start_thread
15: clone
@droid192 I believe the fix for it was the last commit in https://github.com/mimblewimble/grin/pull/3695 which is https://github.com/mimblewimble/grin/pull/3695/commits/3524b7021158da068e0f247a90c30c5f2eeb8eb6. Are you using the latest version https://github.com/mimblewimble/grin/releases/tag/v5.2.0-alpha.2 ?
updated 5.1 to 5.2.0-a2. So far so good.