Closed aksarkar closed 7 months ago
@aksarkar Can you please try out the better-multithread branch and let me know what you think? Note this does not address the mclapply memory issue.
@aksarkar Just checking to see if you have a few minutes to review these changes.
@pcarbo Some testing indicates that limiting OpenBLAS parallelism is also needed for fit_poisson_nmf
.
On c71461e
devtools::load_all()
data(pbmc_facs)
num_physical_cores <- RhpcBLASctl::get_num_cores()
num_logical_cores <- RhpcBLASctl::get_num_procs()
blas_default_threads <- RhpcBLASctl::blas_get_num_procs()
my_init <- function(nc) {
fit_poisson_nmf(pbmc_facs$counts, k=6, method='em', init.method='random',
verbose='none', numiter=10, control=list(nc=nc))
}
my_fit <- function(fit0, nc) {
fit_poisson_nmf(pbmc_facs$counts, fit0=fit0, method='scd',
verbose='none', numiter=1e3,
control=list(nc=nc, extrapolate=TRUE, min.res=1e-2))
}
for (n in c(num_physical_cores, num_logical_cores)) {
for (m in c(blas_default_threads, 1)) {
print(sprintf('%d RcppParallel workers, %d OpenBLAS threads', n, m))
RhpcBLASctl::blas_set_num_threads(m)
set.seed(1)
print(system.time(fit0 <- my_init(n)))
print(system.time(my_fit(fit0, n)))
}
}
yields
[1] "16 RcppParallel workers, 32 OpenBLAS threads"
user system elapsed
73.165 141.964 11.883
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
user system elapsed
1225.925 2637.232 160.461
[1] "16 RcppParallel workers, 1 OpenBLAS threads"
user system elapsed
25.081 0.126 8.246
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
user system elapsed
354.448 1.007 96.804
[1] "32 RcppParallel workers, 32 OpenBLAS threads"
user system elapsed
80.329 142.079 10.556
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
user system elapsed
1158.539 2309.982 147.589
[1] "32 RcppParallel workers, 1 OpenBLAS threads"
user system elapsed
36.314 0.437 7.999
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
user system elapsed
595.774 2.235 95.764
@aksarkar I'm not sure I follow. It appears that the best runtimes are with 1 OpenBLAS thread. Can you clarify?
The results are similar as for my benchmarking of de_analysis
: OpenBLAS threads should be set to 1 by default for all computations.
Currently, there is only support for setting nc.blas
in de_analysis
as of 02c10d44ac2ea2899549869505973d3b52a49dac. I suggest adding an option to set it in fit_poisson_nmf
also, and setting the default to nc.blas=1
.
Sure, that makes sense — I'll make that change in the better-multithread
branch.
@aksarkar I've merged the better-multithread
into master; this now implements the nc.blas
control setting for fit_poisson_nmf()
and de_analysis()
(with a default setting of 1), and incorporates most of the suggested changes from your pull request. Can you please test it out and let me know if additional improvements are needed?
@aksarkar Same question after merging your latest pull request (PR #54).
I think the documentation issue is resolved.
Great, thanks!
There are several types of parallel computing going on in the package:
multi-threading within BLAS. Based on some simple benchmarking (https://gist.github.com/aksarkar/664b8ce987f7b6be8934cb2a0216d909), this appears to require being disabled like
multi-threading within RcppParallel, which is used for
parallelFor
in EM/SCD updatesmulti-processing using
mclapply
, which is used inde_analysis
.Currently,
control$nc
is used to specify both (2) and (3), which is a bit confusing and should be clarified in the relevant parts of the documentation and verbose output. Compare https://www.rdocumentation.org/packages/parallel/versions/3.4.1/topics/mclapply to https://rcppcore.github.io/RcppParallel/#threads_used(3) leads to high memory usage which appears to be essentially unavoidable, since every subprocess gets a copy of the entire R environment at the time
mclapply
is called. This should be mentioned in the docs, because it also leads to unexpected behavior (hanging).And, (1) is not currently taken care of by the package, which leads to confusing behavior (#37). It might be better to take care of that in this package rather than documenting and expecting the user to take care of it.