file descriptor leak - Githubissues

deevope commented 2 years ago

20220710 06:47:18.596 ERROR grin_util::logger - 
thread 'peer_connect' panicked at 'clone conn for reader failed: Os { code: 24, kind: Uncategorized, message: "Too many open files" }': p2p/src/conn.rs:224   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>

yeastplume commented 2 years ago

paste paste (1)

This image shows file descriptors being left open in /proc/[pid/fd, which when counted just happen to be at the user FD limit.

To reproduce, I'd try running a node on a closed network and keep an eye on the number of socket descriptors in /proc/[pid]/fd, then connect/disconnect peers while checking the count of file descriptors in that directory. Then whatever fix is needed can also be tested that way.

Would focus on trying to reproduce in the pibd_impl branch, since it's due to be merged (and won't be merged until this is addressed)

s7500 commented 2 years ago

I got down to doing a task on my own server and decided to follow the note However, I apparently do not fully understand the technology of connecting peers In the config I have the following settings:

[server.p2p_config]
host = "0.0.0.0"
port = 3414

But when I make: ./grin client unban -p 0.0.0.0:3414 or ./grin client unban -p 127.0.0.1:3414 I have no new peers in my grin server

I up grin server in usernet

yeastplume commented 2 years ago

I have this reproduced after running a node fresh for a couple of weeks, we can see the file handles being left oper:

The last message in the logs before this starts happening as to do with header sync being enabled, seen this on mainnet and on a testnet node that's starting to leak threads

In any case, the debugging the process reveals plenty of stuck threads that correspond to the open file handles:

With backtrace as follows:

yeastplume commented 2 years ago

Threads seem to be stuck trying to obtain a read lock of the header pmmr:

Chain::process_block_header, thread peer_read, stuck waiting for exclusive lock on either header_pmmr or txhashset

peer_write thread appears to be parked and waiting .. think this is the only instance of a write lock for the header pmmr, so good possibility this is what's blocking all the other threads

chain::Chain::get_header_by_height, thread 'compactor', also appears to be awaiting a read lock on header_pmmr

Chain::get_locator_hashes, thread sync, also appears to be waiting for a write lock on header_pmmr

NetToChainAdapter, thread peer_read, wating for read lock on header_pmmr

yeastplume commented 2 years ago

Also one instance of an exclusive lock waiting for all readers to complete, grin_chain::chain::Chain::process_block_header

yeastplume commented 2 years ago

In normal operations, when a node falls 5 blocks behind, header sync kicks off followed by a message that syncronization was successful.

20220824 12:49:20.586 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 1965578938893570, peer_difficulty 1965580349437861, threshold 872616680 (last 5 blocks), enabling sync 20220824 12:49:20.590 INFO grin_servers::common::adapters - Received 8 block headers from 172.26.14.223:3414 20220824 12:49:30.595 INFO grin_servers::common::adapters - Received 10 block headers from 172.26.12.78:3414 20220824 12:49:31.898 INFO grin_servers::grin::sync::syncer - synchronized at 1965580705767519 @ 1889250 [000379751ed9]

On nodes where the issue occurs, we see sync being enabled, followed by a 'no outbound peers, considering inbound' message, and then nothing until we start seeing too many open files errors. This would likely indicate the issue is the header_pmmr being write locked in the sync thread yet is unable to complete header syncing.

droid192 commented 1 year ago

same issue, more log data:

20230215 16:09:58.375 ERROR grin_util::logger - 
thread 'peer_connect' panicked at 'clone conn for reader failed: Os { code: 24, kind: Uncategorized, message: "Too many open files" }': p2p/src/conn.rs:224   0: grin_util::logger::send_panic_to_log::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at /rustc/abc...abc/library/std/src/panicking.rs:610
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/abc...abc/library/std/src/panicking.rs:502
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/abc...abc/library/std/src/sys_common/backtrace.rs:139
   4: rust_begin_unwind
             at /rustc/abc...abc/library/std/src/panicking.rs:498
   5: core::panicking::panic_fmt
             at /rustc/abc...abc/library/core/src/panicking.rs:116
   6: core::result::unwrap_failed
             at /rustc/abc...abc/library/core/src/result.rs:1690
   7: grin_p2p::conn::listen
   8: grin_p2p::peer::Peer::new
   9: grin_p2p::peer::Peer::connect
  10: grin_p2p::serv::Server::connect
  11: std::sys_common::backtrace::__rust_begin_short_backtrace
  12: core::ops::function::FnOnce::call_once{{vtable.shim}}
  13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/abc...abc/library/alloc/src/boxed.rs:1854
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/abc...abc/library/alloc/src/boxed.rs:1854
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/abc...abc/library/std/src/sys/unix/thread.rs:108
  14: start_thread
  15: clone

phyro commented 1 year ago

@droid192 I believe the fix for it was the last commit in https://github.com/mimblewimble/grin/pull/3695 which is https://github.com/mimblewimble/grin/pull/3695/commits/3524b7021158da068e0f247a90c30c5f2eeb8eb6. Are you using the latest version https://github.com/mimblewimble/grin/releases/tag/v5.2.0-alpha.2 ?

droid192 commented 1 year ago

updated 5.1 to 5.2.0-a2. So far so good.

mimblewimble / grin

file descriptor leak #3729