Very high idle CPU usage due to peer Arbiter threads

leoluk commented 3 years ago

Near starts one Arbiter per peer, which means dedicated event loop, in a dedicated thread:

https://github.com/near/nearcore/blob/2dbae5d83038b1724f45f19bcbe4baa58ca8e391/chain/network/src/peer_manager.rs#L426

Each of these threads consumes a large amount of CPU time even if its peer is idle. Most of this CPU time is spent in this function:

Thread 53 (Thread 0x7f4505400700 (LWP 250063)):
#0  0x00007f452a371001 in __memset_avx2_erms () from /lib64/libc.so.6
#1  0x000055e641523ed9 in <actix::stream::ActorStream<A,M,S> as actix::fut::ActorFuture>::poll ()
#2  0x000055e6415216d8 in <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll ()
#3  0x000055e64151f061 in tokio::runtime::task::raw::poll ()
#4  0x000055e6419e23ae in tokio::task::local::LocalSet::tick ()
#5  0x000055e640f55c89 in <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll ()
#6  0x000055e640f52284 in actix_rt::runtime::Runtime::block_on ()
#7  0x000055e640f5051b in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#8  0x000055e640f4fef6 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#9  0x000055e64199feda in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/a74d1862d4d87a56244958416fd05976c58ca1a8/src/liballoc/boxed.rs:1034
#10 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/a74d1862d4d87a56244958416fd05976c58ca1a8/src/liballoc/boxed.rs:1034
#11 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#12 0x00007f452a7f52de in start_thread () from /lib64/libpthread.so.0
#13 0x00007f452a30ee83 in clone () from /lib64/libc.so.6

...making up a large percentage of the total CPU usage:

It appears that <actix::stream::ActorStream<A,M,S>::poll() is a busy poll. This code path was touched in 0.10, and the inner loop is gone: https://github.com/actix/actix/issues/360

Perhaps that helps? But I guess we'd just busy poll here instead:

https://github.com/actix/actix/blob/fdaa5d50e25ffc892f5c1c6fcc51097796debecf/src/contextimpl.rs#L345

Related discussion: https://github.com/near/nearcore/issues/2754

To my understanding, Actix does not support M:N scheduling with a thread pool like the Go runtime (or Tokio) does, and in order to prevent CPU-bound actors from blocking each other, dedicated OS threads are used. With a small number of long-lived peers, the per-thread overhead shouldn't really matter, but the event loop implementation in the current Actix release appears to be way too expensive to run it once per peer.

[x] implement fix
[x] Merge fix to master
[ ] Test 1.18.0-rc3 and check flame graphs again

pmnoxx commented 3 years ago

@bowenwang1996 I'll take a look.

bowenwang1996 commented 3 years ago

@leoluk Actually I don't think peer threads are burning cpus when there are no messages. I just took a look at our validators on testnet. They have about 35 peers (same as nodes on mainnet) and the cpu consumption is around 30% of one cpu, which is much lower than what is observed on mainnet.

pmnoxx commented 3 years ago

Near starts one Arbiter per peer, which means dedicated event loop, in a dedicated thread:

https://github.com/near/nearcore/blob/2dbae5d83038b1724f45f19bcbe4baa58ca8e391/chain/network/src/peer_manager.rs#L426

Each of these threads consumes a large amount of CPU time even if its peer is idle. Most of this CPU time is spent in this function:
Thread 53 (Thread 0x7f4505400700 (LWP 250063)):
#0  0x00007f452a371001 in __memset_avx2_erms () from /lib64/libc.so.6
#1  0x000055e641523ed9 in <actix::stream::ActorStream<A,M,S> as actix::fut::ActorFuture>::poll ()
#2  0x000055e6415216d8 in <actix::contextimpl::ContextFut<A,C> as core::future::future::Future>::poll ()
#3  0x000055e64151f061 in tokio::runtime::task::raw::poll ()
#4  0x000055e6419e23ae in tokio::task::local::LocalSet::tick ()
#5  0x000055e640f55c89 in <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll ()
#6  0x000055e640f52284 in actix_rt::runtime::Runtime::block_on ()
#7  0x000055e640f5051b in std::sys_common::backtrace::__rust_begin_short_backtrace ()
#8  0x000055e640f4fef6 in core::ops::function::FnOnce::call_once{{vtable-shim}} ()
#9  0x000055e64199feda in <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/a74d1862d4d87a56244958416fd05976c58ca1a8/src/liballoc/boxed.rs:1034
#10 <alloc::boxed::Box<F> as core::ops::function::FnOnce<A>>::call_once () at /rustc/a74d1862d4d87a56244958416fd05976c58ca1a8/src/liballoc/boxed.rs:1034
#11 std::sys::unix::thread::Thread::new::thread_start () at src/libstd/sys/unix/thread.rs:87
#12 0x00007f452a7f52de in start_thread () from /lib64/libpthread.so.0
#13 0x00007f452a30ee83 in clone () from /lib64/libc.so.6
...making up a large percentage of the total CPU usage:

It appears that <actix::stream::ActorStream<A,M,S>::poll() is a busy poll. This code path was touched in 0.10, and the inner loop is gone: actix/actix#360

Perhaps that helps? But I guess we'd just busy poll here instead:

https://github.com/actix/actix/blob/fdaa5d50e25ffc892f5c1c6fcc51097796debecf/src/contextimpl.rs#L345

Related discussion: #2754

To my understanding, Actix does not support M:N scheduling with a thread pool like the Go runtime (or Tokio) does, and in order to prevent CPU-bound actors from blocking each other, dedicated OS threads are used. With a small number of long-lived peers, the per-thread overhead shouldn't really matter, but the event loop implementation in the current Actix release appears to be way too expensive to run it once per peer.

In order to explain why pool isn't responsible for CPU usage, but we get results as seen above I need to explain how flame graphs are created.

Those flame graphs are computed by doing random sampling. At random interval of time, you freeze the whole process and you create a stack trace of each thread. Then you sum the data, that creates a tree that gets displayed as flame graph.

Each threads runs a loop, when it has nothing to do in sleep on pool. This is system specific, for example on linux internally a C epoll function would be called. (https://man7.org/linux/man-pages/man7/epoll.7.html). Epoll allows you to sleep on a socket. It an be any type of socket, like network or file socket, etc.

If you measure what the thread is doing at random time, you will see it's stack pointer is pointing to poll method. This method doesn't tell you how much processor time thread uses by each function, but rather it tells you the percent of time spend in each function. Those are not the same things. I looked at our Peer stats, I see that Peers spend over 98% idling.

In our example: We have up to 40 peers, each running on it's own thread and let's say 5-10 other thread. Peers spend 98% time idling, so it's expected to see ::pool() method represent over 90% of time spend by threads. That's not CPU time, and that's fine.

In order to measure accurately the time computed, we could modify the tool used for creating flame graphs to exclude calling __memset_avx2_erms () from /lib64/libc.so.6.

leoluk commented 3 years ago

I'm very well aware of how a sampling profiler works and successfully replicated the results by manually sampling the stack a few times using gdb (that's how the stacktrace above is generated).

Each threads runs a loop, when it has nothing to do in sleep on pool. This is system specific, for example on linux internally a C epoll function would be called. (man7.org/linux/man-pages/man7/epoll.7.html).

Spending its time waiting for epoll is what it should be doing, but it's not - only a small fraction of the time is spent in epoll/syscalls, instead, it spends a lot of time burning through CPU cycles via __memset_avx2_erms.

Here's another flamegraph from a different thread that illustrates it better:

This method doesn't tell you how much processor time thread uses by each function, but rather it tells you the percent of time spend in each function. Those are not the same things.

The problem is that each of these arbiter threads uses a lot of absolute processor time:

leoluk commented 3 years ago

After changing max_peers to 10, the number of actix-rt:worker threads and CPU usage has gone down:

The remaining threads still use a lot of CPU each.

pmnoxx commented 3 years ago

I'm very well aware of how a sampling profiler works and successfully replicated the results by manually sampling the stack a few times using gdb (that's how the stacktrace above is generated).

Each threads runs a loop, when it has nothing to do in sleep on pool. This is system specific, for example on linux internally a C epoll function would be called. (man7.org/linux/man-pages/man7/epoll.7.html).

Spending its time waiting for epoll is what it should be doing, but it's not - only a small fraction of the time is spent in epoll/syscalls, instead, it spends a lot of time burning through CPU cycles via __memset_avx2_erms.

Here's another flamegraph from a different thread that illustrates it better:

This method doesn't tell you how much processor time thread uses by each function, but rather it tells you the percent of time spend in each function. Those are not the same things.

The problem is that each of these arbiter threads uses a lot of absolute processor time:

You are right, I'll investigate. Which operating system are you using?

leoluk commented 3 years ago

You are right, I'll investigate. Which operating system are you using?

CentOS 8. Custom-built --release binary. i7-6700 in case CPU generation matters:

flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds

pmnoxx commented 3 years ago

@leoluk I figured what the issues is. It turns out that tokio initializes whole read buffer, before doing a pool read. That buffer can have a size of up to 12mb. So if you want to read just a few bytes, you still need to do work proportional to the size of whole buffer. That buffer grows and stays big. This explains why we see the issue on mainnet, but not on testnet.

tokio-0.2.23/src/io/async_read.rs:121:

fn poll_read_buf<B: BufMut>(
        self: Pin<&mut Self>,
        cx: &mut Context<'_>,
        buf: &mut B,
    ) -> Poll<io::Result<usize>>
    where
        Self: Sized,
    {
        if !buf.has_remaining_mut() {
            return Poll::Ready(Ok(0));
        }
        unsafe {
            let n = {
                let b = buf.bytes_mut();
                self.prepare_uninitialized_buffer(b);
                // Convert to `&mut [u8]`
                let b = &mut *(b as *mut [MaybeUninit<u8>] as *mut [u8]);
                let n = ready!(self.poll_read(cx, b))?;
                assert!(n <= b.len(), "Bad AsyncRead implementation, more bytes were reported as read than the buffer can hold");
                n
            };
            buf.advance_mut(n);
            Poll::Ready(Ok(n))
        }
    }

We are using tokio 0.2.3. The issue is fixed in tokio 0.3.0. Actix "0.9.0" uses tokio 0.2.6 While Actix 0.10.0 uses tokio 1.

So upgrading actix from 0.9.0 to 0.10.0 should fix the issue. By commenting out self.prepare_uninitialized_buffer(b); I noticed improvement in CPU usage.

leoluk commented 3 years ago

That would explain all the time spent in memset! :D

Nice catch - let me know if we can help test a fix.

pmnoxx commented 3 years ago

I included a hotfix here: https://github.com/near/nearcore/pull/3858

pmnoxx commented 3 years ago

That would explain all the time spent in memset! :D

Nice catch - let me know if we can help test a fix.

Can you test my hotfix to see if it improves CPU usage.

leoluk commented 3 years ago

Is it safe to cherry pick the full PR for mainnet?

bowenwang1996 commented 3 years ago

@leoluk I suggest that you wait until we land #3869

pmnoxx commented 3 years ago

@leoluk We merged the fix to master. Can you check stats again?

leoluk commented 3 years ago

Is master safe to deploy on mainnet?

pmnoxx commented 3 years ago

@leoluk It is a big change. We upgraded tokio from 0.2 to 1.0 and actix. If possible I would roll out the change slowly.

bowenwang1996 commented 3 years ago

@leoluk you can test it with this pre-release https://github.com/near/nearcore/releases/tag/1.18.0-rc.3. From our testing it completely addresses the issue.

bowenwang1996 commented 3 years ago

Fixed in the 1.18.0 release

NeilZhy commented 6 months ago

@leoluk May I ask what tool you use for the function stack shown in the picture?

leoluk commented 6 months ago

That's just standard perf

near / nearcore

Very high idle CPU usage due to peer Arbiter threads #3854