stephenslab / fastTopics

Fast algorithms for fitting topic models and non-negative matrix factorizations to count data.
https://stephenslab.github.io/fastTopics
Other
77 stars 7 forks source link

Improve documentation for parallel computing options #39

Closed aksarkar closed 7 months ago

aksarkar commented 1 year ago

There are several types of parallel computing going on in the package:

  1. multi-threading within BLAS. Based on some simple benchmarking (https://gist.github.com/aksarkar/664b8ce987f7b6be8934cb2a0216d909), this appears to require being disabled like

    RhpcBLASctl::blas_set_num_threads(1)
  2. multi-threading within RcppParallel, which is used for parallelFor in EM/SCD updates

  3. multi-processing using mclapply, which is used in de_analysis.

Currently, control$nc is used to specify both (2) and (3), which is a bit confusing and should be clarified in the relevant parts of the documentation and verbose output. Compare https://www.rdocumentation.org/packages/parallel/versions/3.4.1/topics/mclapply to https://rcppcore.github.io/RcppParallel/#threads_used

(3) leads to high memory usage which appears to be essentially unavoidable, since every subprocess gets a copy of the entire R environment at the time mclapply is called. This should be mentioned in the docs, because it also leads to unexpected behavior (hanging).

And, (1) is not currently taken care of by the package, which leads to confusing behavior (#37). It might be better to take care of that in this package rather than documenting and expecting the user to take care of it.

pcarbo commented 1 year ago

@aksarkar Can you please try out the better-multithread branch and let me know what you think? Note this does not address the mclapply memory issue.

pcarbo commented 1 year ago

@aksarkar Just checking to see if you have a few minutes to review these changes.

aksarkar commented 8 months ago

@pcarbo Some testing indicates that limiting OpenBLAS parallelism is also needed for fit_poisson_nmf.

On c71461e

devtools::load_all()
data(pbmc_facs)

num_physical_cores <- RhpcBLASctl::get_num_cores()
num_logical_cores <- RhpcBLASctl::get_num_procs()
blas_default_threads <- RhpcBLASctl::blas_get_num_procs()

my_init <- function(nc) {
  fit_poisson_nmf(pbmc_facs$counts, k=6, method='em', init.method='random',
                  verbose='none', numiter=10, control=list(nc=nc))
}

my_fit <- function(fit0, nc) {
  fit_poisson_nmf(pbmc_facs$counts, fit0=fit0, method='scd',
                  verbose='none', numiter=1e3,
                  control=list(nc=nc, extrapolate=TRUE, min.res=1e-2))
}

for (n in c(num_physical_cores, num_logical_cores)) {
  for (m in c(blas_default_threads, 1)) {
    print(sprintf('%d RcppParallel workers, %d OpenBLAS threads', n, m))
    RhpcBLASctl::blas_set_num_threads(m)
    set.seed(1)
    print(system.time(fit0 <- my_init(n)))
    print(system.time(my_fit(fit0, n)))
  }
}

yields

[1] "16 RcppParallel workers, 32 OpenBLAS threads"
   user  system elapsed
 73.165 141.964  11.883
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
    user   system  elapsed
1225.925 2637.232  160.461
[1] "16 RcppParallel workers, 1 OpenBLAS threads"
   user  system elapsed
 25.081   0.126   8.246
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
   user  system elapsed
354.448   1.007  96.804
[1] "32 RcppParallel workers, 32 OpenBLAS threads"
   user  system elapsed
 80.329 142.079  10.556
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
    user   system  elapsed
1158.539 2309.982  147.589
[1] "32 RcppParallel workers, 1 OpenBLAS threads"
   user  system elapsed
 36.314   0.437   7.999
Stopping condition is met at iteration 120: max. KKT residual < 1.00e-02
   user  system elapsed
595.774   2.235  95.764
pcarbo commented 8 months ago

@aksarkar I'm not sure I follow. It appears that the best runtimes are with 1 OpenBLAS thread. Can you clarify?

aksarkar commented 8 months ago

The results are similar as for my benchmarking of de_analysis: OpenBLAS threads should be set to 1 by default for all computations.

Currently, there is only support for setting nc.blas in de_analysis as of 02c10d44ac2ea2899549869505973d3b52a49dac. I suggest adding an option to set it in fit_poisson_nmf also, and setting the default to nc.blas=1.

pcarbo commented 8 months ago

Sure, that makes sense — I'll make that change in the better-multithread branch.

pcarbo commented 8 months ago

@aksarkar I've merged the better-multithread into master; this now implements the nc.blas control setting for fit_poisson_nmf() and de_analysis() (with a default setting of 1), and incorporates most of the suggested changes from your pull request. Can you please test it out and let me know if additional improvements are needed?

pcarbo commented 7 months ago

@aksarkar Same question after merging your latest pull request (PR #54).

aksarkar commented 7 months ago

I think the documentation issue is resolved.

pcarbo commented 7 months ago

Great, thanks!