Closed frol closed 5 months ago
@chefsale @mhalambek @khorolets Do we observe such severe memory leaks on other nodes (RPCs, validators, indexers)?
cc @janewang
@frol I'm not running localnet node for so long to observe it, haven't heard any complains.
No problems on other nodes observed.
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
Looking into it
Left localnet running since Jan 07.
After 3 days, the memory usage is indeed 3GB
^[[2mJan 10 16:14:14.918^[[0m ^[[32m INFO^[[0m stats: ^[[1;33m# 467322 5drWgBQc4uKWbikazGSxJfsArDgvcRYxb3znxud1NiQM^[[0m ^[[1;37mV/4^[[0m ^[[1;36m 3/3/40 peers ⬇ 6.6kiB/s ⬆ 6.5kiB/s^[[0m ^[[1;32m1.70 bps 0 gas/s^[[0m ^[[1;34mCPU: 19%, Mem: 3.2 GiB^[[0m
According to memory_stats
, the usage grew to 150GB:
^[[2mJan 10 16:06:22.801^[[0m ^[[33m WARN^[[0m near_rust_allocator_proxy::allocator: Thread 1570375 reached new record of memory usage 131677MiB
0: <near_rust_allocator_proxy::allocator::MyAllocator<A> as core::alloc::global::GlobalAlloc>::alloc
1: cached::lru_list::LRUList<T>::with_capacity
2: near_client::sync::StateSync::new
3: near_client::client::Client::run_catchup
4: near_client::client_actor::ClientActor::catchup
5: <actix::utils::TimerFunc<A> as actix::fut::ActorFuture>::poll
6: <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll
7: tokio::runtime::task::harness::Harness<T,S>::poll
8: tokio::task::local::LocalSet::tick
9: tokio::macros::scoped_tls::ScopedKey<T>::set
10: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll
11: tokio::macros::scoped_tls::ScopedKey<T>::set
12: tokio::runtime::basic_scheduler::BasicScheduler<P>::block_on
13: tokio::runtime::Runtime::block_on
14: neard::cli::NeardCmd::parse_and_run
15: std::sys_common::backtrace::__rust_begin_short_backtrace
16: std::rt::lang_start::{{closure}}
17: core::ops::function::impls::<impl core::ops::function::FnOnce<A> for &F>::call_once
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/core/src/ops/function.rs:259:13
std::panicking::try::do_call
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
std::panicking::try
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
std::panic::catch_unwind
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
std::rt::lang_start_internal::{{closure}}
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/rt.rs:128:48
std::panicking::try::do_call
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:403:40
std::panicking::try
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panicking.rs:367:19
std::panic::catch_unwind
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/panic.rs:133:14
std::rt::lang_start_internal
at /rustc/f1edd0429582dd29cccacaf50fd134b05593bd9c/library/std/src/rt.rs:128:20
18: main
19: __libc_start_main
20: _start
added: 14
Seems to be a duplicate of https://github.com/near/nearcore/issues/3970
@nikurt My advice is to try the fix: https://github.com/near/nearcore/pull/5902
@pmnoxx Same issue with the fix. After 2 days, each node uses 2.3GB of resident memory:
Jan 13 11:21:25.641 INFO stats: # 308245 GHCj5bHaEKb5tDCXpQ5VacGbX2DoimNosDTukGJ7NifN V/4 3/3/40 peers ⬇ 7.0kiB/s ⬆ 7.1kiB/s 1.90 bps 0 gas/s CPU: 3%, Mem: 2.3 GiB
and 100GB of virtual memory:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1901133 nikurt 20 0 103.8g 2.3g 21304 S 3.0 15.0 596:00.68 neard
confirmed that I'm running a binary with the fix applied:
% cat /proc/1901133/cmdline
./target/release/neard --home /home/nikurt/.near/localnet/node0 run
% realpath /proc/1901133/cwd
/home/nikurt/nearcore-leak
% nikurt@nikurt-1 Thu 11:27:49 ~/nearcore-leak % git log -1
commit 5b2db783e3d7a73b63be35c935c967779328b6b9
Author: Piotr Mikulski <piotr@near.org>
Date: Sun Dec 19 05:12:57 2021 -0800
Fix memory leak in `near-network` affecting `indexer`
2: near_client::sync::StateSync::new
Hmm looks like it may be related to https://github.com/near/nearcore/issues/3509. cc @mm-near @mzhangmzz @pmnoxx
This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.
sorry I just saw this. Will take a look this week
Describe the bug
![image](https://user-images.githubusercontent.com/304265/127002515-3297c5aa-d698-48e6-b2ec-b1f74eee4f6b.png)
To Reproduce
Given our incidents monitoring, the node was getting down every week since April (when the monitoring for CI nodes was configured).
![image](https://user-images.githubusercontent.com/304265/127003086-1fb1728e-10b0-404d-8f86-3a8f9492c507.png)
I cannot say for sure if those were OOM reasons, but there are at least two events that are OOM:
![image](https://user-images.githubusercontent.com/304265/127003839-06a30bce-65c1-4572-8d20-b425a922331b.png)
, and also, the node currently consumes 4.3GB of RAM, so it will get killed in a few days from now.
Expected behavior
The node should not leak more than 100MB of RAM per day.
Version (please complete the following information):