near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.33k stars 629 forks source link

Localnet node leaks 1GB of RAM per day #4570

Closed frol closed 5 months ago

frol commented 3 years ago

Describe the bug

![image](https://user-images.githubusercontent.com/304265/127002515-3297c5aa-d698-48e6-b2ec-b1f74eee4f6b.png)

To Reproduce

Given our incidents monitoring, the node was getting down every week since April (when the monitoring for CI nodes was configured).

![image](https://user-images.githubusercontent.com/304265/127003086-1fb1728e-10b0-404d-8f86-3a8f9492c507.png)

I cannot say for sure if those were OOM reasons, but there are at least two events that are OOM:

![image](https://user-images.githubusercontent.com/304265/127003839-06a30bce-65c1-4572-8d20-b425a922331b.png)

, and also, the node currently consumes 4.3GB of RAM, so it will get killed in a few days from now.

Expected behavior

The node should not leak more than 100MB of RAM per day.

Version (please complete the following information):

frol commented 3 years ago

@chefsale @mhalambek @khorolets Do we observe such severe memory leaks on other nodes (RPCs, validators, indexers)?

cc @janewang

khorolets commented 3 years ago

@frol I'm not running localnet node for so long to observe it, haven't heard any complains.

chefsale commented 3 years ago

No problems on other nodes observed.

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

nikurt commented 2 years ago

Looking into it

nikurt commented 2 years ago

Left localnet running since Jan 07.

After 3 days, the memory usage is indeed 3GB

^[[2mJan 10 16:14:14.918^[[0m ^[[32m INFO^[[0m stats: ^[[1;33m#  467322 5drWgBQc4uKWbikazGSxJfsArDgvcRYxb3znxud1NiQM^[[0m ^[[1;37mV/4^[[0m ^[[1;36m 3/3/40 peers ⬇ 6.6kiB/s ⬆ 6.5kiB/s^[[0m ^[[1;32m1.70 bps 0 gas/s^[[0m ^[[1;34mCPU: 19%, Mem: 3.2 GiB^[[0m

According to memory_stats, the usage grew to 150GB:

^[[2mJan 10 16:06:22.801^[[0m ^[[33m WARN^[[0m near_rust_allocator_proxy::allocator: Thread 1570375 reached new record of memory usage 131677MiB
   0: <near_rust_allocator_proxy::allocator::MyAllocator<A> as core::alloc::global::GlobalAlloc>::alloc
   1: cached::lru_list::LRUList<T>::with_capacity
   2: near_client::sync::StateSync::new
   3: near_client::client::Client::run_catchup
   4: near_client::client_actor::ClientActor::catchup
   5: <actix::utils::TimerFunc<A> as actix::fut::ActorFuture>::poll
   6: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
   7: tokio::runtime::task::harness::Harness<T,S>::poll
   8: tokio::task::local::LocalSet::tick
   9: tokio::macros::scoped_tls::ScopedKey<T>::set
  10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
  11: tokio::macros::scoped_tls::ScopedKey<T>::set
  12: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on
  13: tokio::runtime::Runtime::block_on
  14: neard::cli::NeardCmd::parse_and_run
  15: std::sys_common::backtrace::__rust_begin_short_backtrace
  16: std::rt::lang_start::{{closure}}
  17: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/ops/function.rs:259:13
      std::panicking::try::do_call
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
      std::panicking::try
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
      std::panic::catch_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
      std::rt::lang_start_internal::{{closure}}
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/rt.rs:128:48
      std::panicking::try::do_call
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
      std::panicking::try
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
      std::panic::catch_unwind
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
      std::rt::lang_start_internal
             at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/rt.rs:128:20
  18: main
  19: __libc_start_main
  20: _start
 added: 14    
nikurt commented 2 years ago

Seems to be a duplicate of https://github.com/near/nearcore/issues/3970

pmnoxx commented 2 years ago

@nikurt My advice is to try the fix: https://github.com/near/nearcore/pull/5902

nikurt commented 2 years ago

@pmnoxx Same issue with the fix. After 2 days, each node uses 2.3GB of resident memory:

Jan 13 11:21:25.641  INFO stats: #  308245 GHCj5bHaEKb5tDCXpQ5VacGbX2DoimNosDTukGJ7NifN V/4  3/3/40 peers ⬇ 7.0kiB/s ⬆ 7.1kiB/s 1.90 bps 0 gas/s CPU: 3%, Mem: 2.3 GiB    

and 100GB of virtual memory:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                                                                                                                                                                                                                                                                              
1901133 nikurt    20   0  103.8g   2.3g  21304 S   3.0  15.0 596:00.68 neard                                                                                                                                                                                                                                                                                                                                                                                                                

confirmed that I'm running a binary with the fix applied:

% cat /proc/1901133/cmdline 
./target/release/neard --home /home/nikurt/.near/localnet/node0 run
% realpath /proc/1901133/cwd
/home/nikurt/nearcore-leak
% nikurt@nikurt-1 Thu 11:27:49 ~/nearcore-leak % git log -1
commit 5b2db783e3d7a73b63be35c935c967779328b6b9
Author: Piotr Mikulski <piotr@near.org>
Date:   Sun Dec 19 05:12:57 2021 -0800

    Fix memory leak in `near-network` affecting `indexer`
bowenwang1996 commented 2 years ago

2: near_client::sync::StateSync::new

Hmm looks like it may be related to https://github.com/near/nearcore/issues/3509. cc @mm-near @mzhangmzz @pmnoxx

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

mzhangmzz commented 2 years ago

sorry I just saw this. Will take a look this week