Multicore (in)efficiency

rwdavies / QUILT

QUILT: Low coverage whole genome sequence imputation with large reference panels

https://www.nature.com/articles/s41588-021-00877-0

GNU General Public License v3.0

53 stars 11 forks source link

Multicore (in)efficiency #22

Open moravveji opened 1 year ago

moravveji commented 1 year ago

Dear

I am a HPC sysadmin, and together with a QUILT user, we are observing a strange multi-core behavior of the software on a cluster with 128-core AMD nodes. There, QUILT is installed and launched from a conda environment. We block a full node with about 120GB memory, and try to impute 60 samples, but using different core numbers, which is controlled by the ncore= command line argument.

Apparently (please correct me if I am mistaken), QUILT spawns multiple processes (and not threads), so, I expected that each process would be pinned to one physical core on the system (hyperthreading is disabled on the target machine).

What we observe on the compute node is that all "active" processes are all packed on the first physical core on the node, with inefficient CPU activity/usage. As a result of that, the runtime of the test example does not scale with the number of cores used, i.e. ncore.

The attached screenshot demonstrates the output of htop command.

We would like to know whether this is a standard behavior or there are still hooks in launching a job which we do not know completely.

Any feedback is welcome.

Kind regards Ehsan & @sarabecelaere

Zilong-Li commented 1 year ago

Hi,

Regarding the parallelization I think QUILT and STITCH use basic mclapply doing parallelization which would fork multiple processes with each core handling several samples. If the number of samples (60) is not a multiple of nCores (8), then the core with more samples in will be the bottleneck. However, about the screenshot you showed I suppose it is unexpected.

moravveji commented 1 year ago

Thanks @Zilong-Li for your swift reaction on my post. I took a look at mclapply in R docs. Apparently, (I guess that) QUILT is not exposing all runtime arguments of mclapply to the users, hence it is not possible to control the process forking by spawning/pinning the process over multiple cores (rather than letting them attack a single core, by default, at least in our case). Please correct me if I am wrong. That is leading to a drastic performance drop on the target cluster we use. Not sure how R is doing the process pinning under the hood, and if there are ways to control it from the user side.

We will repeat the tests by keeping in mind that sample size must be a multiple of nCores, for a more controlled load balancing. We can share our findings here, if this is useful to the users.

In case you'd like to reproduce our results, please let us know.

rwdavies commented 1 year ago

Hi,

Minor point, the argument is nCores, but R pattern matching should sort that out.

This is an interesting one. QUILT (and STITCH) indeed use pretty straightforward parallel::mclapply functionality.

So for instance, you could test things out more generally by trying the following from the command line, varying mc.cores = 2 (which QUILT set using nCores), and 1:2 (choosing a number, here 2, and making sure you put it in both spots).

R -e "parallel::mclapply(1:2, mc.cores = 2, function(x) { while(TRUE) { a = mean(runif(1e6))}})"

Hopefully this should replicate behaviour like what you are seeing. If you figure out additional arguments to mclapply that are beneficial and think should be exposed, let me know.

As a more minor point, QUILT imputes each sample independently (unlike STITCH), so in general on an HPC I would recommend splitting samples into small batches and running with nCores = 1, then combining the results at the end.

Best, Robbie

moravveji commented 1 year ago

Thanks @rwdavies for your comments. As a matter of fact, the mclapply is a not an efficient/scalable tool for concurrent tasks. Judging by the following scaling results on the official CRAN pages, I also concluded that the best way out of this is using nCores=1, as you proposed.

With that, I guess we can close this thread, unless someone else still has a remark.

rwdavies commented 1 year ago

I don't think the reference you're citing is particularly definitive. But I'll admit I'm having trouble finding something exhaustive.

It usually works fine in my hands, e.g. setting nCores = 4 on one of our standard HPC machines makes thing run in ~1/4th of the time for large sample numbers, but at the cost of difficulty to schedule. I just tried using nCores = 3 with the test data on the website and indeed I see a tripling of performance. This is on Intel(R) Xeon(R) Gold 6126 CPU @ 2.60GHz