near / nearcore

Reference client for NEAR Protocol
https://near.org
GNU General Public License v3.0
2.31k stars 613 forks source link

Flesh out concurrency and parallelism story #4882

Open matklad opened 2 years ago

matklad commented 2 years ago

There issue is prompted by https://github.com/near/nearcore/pull/4866#discussion_r716625127. There, it was observed that we essentially use two different CPU pool for CPU-intensitev tasks, which could lead to oversubscription. We should fix that particular issue, but I think it's a symptom of a bigger one. We don't have a well-understood model of concurrency to use in near.

So, as a deliverable for this issue, I suggest that we add "Concurrency" subsection to the cross-cutting concerns in our [architecture]((https://github.com/near/nearcore/blob/master/docs/architecture.md)) docs.

matklad commented 2 years ago

My thoughts on the topic:

We need to spell out what are the logical bits of concurrency (each shard, each peer we communicate with, JSON RPC), and what are CPU-heavy computations which we want to parallelise or pipeline (ibf construction in network, catch up in chain, compiling many contracts during apply chunks in transaction runtime, compiling individual contract in contract runtime).

Logical things should not be CPU heavy, so it should be fine to burn a thread per each one, and this I think is what we are doing. CPU heavy tasks though compete globally for the available CPU cores, and we shouldn't be spawning n cpus * n subsystems threads (which we are doing right now). An additional consideration here is that we want to artificially limit parallelism. We want validators to be usable with a handful of cores. If we take advantage of 64 CPUs, we might actually start to accidentally require 64 CPUs for validators.

In terms of the diff to the current code-base, I think we should standardize on rayon as the way to do CPU-heavy computations. In particular, we should be using rayons global, default thread pool (that avoids over subscription). We should set-up this pool with a limited number of threads in neard's main.

It also seems to be that we sometimes use actix arbiters (wrapper around threads) for CPU-heavy work. I am not sure, but I think that's what we did for ibf computation. If that's the case, we probably should replace that with calls to rayon::spawn. Rather than greedily a whole new thread just for yourself, it makes sense to responsibly use community's shared thread pool.

matklad commented 2 years ago

cc @EgorKulikov @mzhangmzz @pmnoxx @bowenwang1996

bowenwang1996 commented 2 years ago

Thanks for creating this issue @matklad ! I can work on documenting this

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity in the last 2 months. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.