Slow startup with `make_cluster()` on Apple aarch64 #123

Closed mtmorgan closed 2 months ago

mtmorgan commented 2 months ago

A simple example shows that parallel::makePSOCKcluster() is 6x faster than mirai's make_cluster() on some platforms.

> system.time(cl <- parallel::makePSOCKcluster(10)); parallel::stopCluster(cl)
   user  system elapsed
  0.008   0.006   0.586
> system.time(m <- mirai::make_cluster(10))
   user  system elapsed
  0.020   0.080   3.282

I think this is because the PSOCK processes are launched asynchronously and then collected, whereas mirai's daemons are created strictly synchronously.

> sessionInfo()
R version 4.4.1 Patched (2024-06-20 r86819)
Platform: aarch64-apple-darwin23.5.0
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /Users/ma38727/bin/R-4-4-branch/lib/libRblas.dylib
LAPACK: /Users/ma38727/bin/R-4-4-branch/lib/libRlapack.dylib;  LAPACK version 3.12.0

[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/New_York
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
 [1] processx_3.8.4      BiocManager_1.30.23 compiler_4.4.1
 [4] R6_2.5.1            cli_3.6.3           parallel_4.4.1
 [7] tools_4.4.1         curl_5.2.1          remotes_2.5.0.9000
[10] desc_1.4.3          callr_3.7.6         ps_1.7.6
[13] pkgbuild_1.4.4      mirai_1.1.1.9000    nanonext_1.1.1
shikokuchuo commented 2 months ago

The parallel interface currently requires synchronization of each daemon on startup for additional safety to ensure that computations can begin immediately (e.g. in a batch script). I have prioritised robustness and not looked to particularly optimise here (as daemon startup is a one-off rather than a recurring operation).

Having said that, for the same R version on Linux on x86_64 (Ubuntu jammy) there is practically no difference in speed when I test your reprex above, so I can only assume that (as you mention) this is only for 'some platforms' i.e. aarch64-apple-darwin23.5.0. I have no particular insight as to this platform.

EDIT: I've tested on Windows x86_64 and mirai is actually faster here as I guess the underlying implementation is more optimal than base R's on this platform.

It may be possible to avoid by using the native mirai interface with mirai_map(), if that's an option.

shikokuchuo commented 2 months ago

I see that I had a naive implementation that really did do launches synchronously... So now in #124 I launch all daemons asynchronously, and synchronize afterwards.

On Linux, mirai::make_cluster(10) now only takes 1/3 of the time of parallel::makePSOCKcluster(10).

Thanks @mtmorgan for pointing this out, much appreciated - this will help not just those using the parallel interface. I expect this should also fix the slowness you were experiencing on apple-aarch64.

mtmorgan commented 2 months ago

These are now approximately comparable on my machine

> system.time(cl <- parallel::makePSOCKcluster(10)); parallel::stopCluster(cl)
   user  system elapsed
  0.008   0.010   0.601
> system.time(m <- mirai::make_cluster(10))
   user  system elapsed
  0.039   0.051   0.642

Thanks for looking into this.

shikokuchuo commented 2 months ago

You're welcome. Thanks for posting an issue!