Segmentation fault in multithreading setup

aksarkar commented 8 months ago

The following example leads to a segfault

library(fastTopics)
data(pbmc_facs)
for (nc in c(1, 2)) {
  set.seed(0)
  temp <- fit_poisson_nmf(pbmc_facs$counts, k=6, init='random', numiter=1, method='em', verbose='detailed', control=list(nc=nc))
}

when run like this

$ OPENBLAS_NUM_THREADS=1 Rscript temp.R 
Fitting rank-6 Poisson NMF to 3774 x 16791 sparse matrix.
Running 1 EM updates, without extrapolation (fastTopics 0.6-117).
iter loglik(PoisNMF) loglik(multinom) res(KKT)  |F-F'|  |L-L'| nz(F) nz(L) beta
   1 -9.57230314e+06 -9.554666952e+06 1.68e+03 1.3e+02 9.5e-01 0.987 1.000 0.00
Fitting rank-6 Poisson NMF to 3774 x 16791 sparse matrix.
Running 1 EM updates, without extrapolation (fastTopics 0.6-117).
iter loglik(PoisNMF) loglik(multinom) res(KKT)  |F-F'|  |L-L'| nz(F) nz(L) beta

 *** caught segfault ***
address 0x600000005f7, cause 'memory not mapped'

Traceback:
 1: pnmfem_update_factors_sparse_parallel_rcpp(X, L, F, i - 1, numiter)
 2: pnmfem_update_loadings(X, F, L, i, numiter, nc)
 3: update_loadings_poisson_nmf(X, fit$F, fit$L, update.loadings,     method, control)
 4: update_poisson_nmf(X, fit, update.factors, update.loadings, method,     control)
 5: fit_poisson_nmf_main_loop(X, fit, numiter, update.factors, update.loadings,     method, control, verbose)
 6: fit_poisson_nmf(pbmc_facs$counts, k = 6, init = "random", numiter = 1,     method = "em", verbose = "detailed", control = list(nc = nc))
An irrecoverable exception occurred. R is aborting now ...
Segmentation fault

The same occurs for any other choice of method. Interestingly, the following example does not:

library(fastTopics)
data(pbmc_facs)
for (method in c('em', 'scd', 'ccd')) {
  for (nc in c(2, 1)) {
    set.seed(0)
    temp <- fit_poisson_nmf(pbmc_facs$counts, k=6, init='random', numiter=1, method='em', verbose='detailed', control=list(nc=nc))
  }
}

R version 4.2.3 (2023-03-15)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Amazon Linux 2

Matrix products: default
BLAS/LAPACK: /home/ec2-user/mambaforge3/envs/fasttopics-dev/lib/libopenblasp-r0.3.26.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] fastTopics_0.6-117

loaded via a namespace (and not attached):
 [1] progress_1.2.3     tidyselect_1.2.0   ashr_2.2-63        purrr_1.0.2       
 [5] pbapply_1.7-2      splines_4.2.3      lattice_0.22-6     colorspace_2.1-0  
 [9] vctrs_0.6.5        generics_0.1.3     viridisLite_0.4.2  htmltools_0.5.7   
[13] MCMCpack_1.7-0     utf8_1.2.4         survival_3.5-8     plotly_4.10.4     
[17] rlang_1.1.3        mixsqp_0.3-54      pillar_1.9.0       glue_1.7.0        
[21] uwot_0.1.16        lifecycle_1.0.4    MatrixModels_0.5-3 munsell_0.5.0     
[25] gtable_0.3.4       htmlwidgets_1.6.4  coda_0.19-4.1      fastmap_1.1.1     
[29] SparseM_1.81       invgamma_1.1       quantreg_5.97      irlba_2.3.5.1     
[33] parallel_4.2.3     fansi_1.0.6        Rcpp_1.0.12        scales_1.3.0      
[37] RcppParallel_5.1.6 jsonlite_1.8.8     truncnorm_1.0-9    mcmc_0.9-8        
[41] hms_1.1.3          ggplot2_3.5.0      digest_0.6.35      Rtsne_0.17        
[45] dplyr_1.1.4        ggrepel_0.9.5      cowplot_1.1.3      grid_4.2.3        
[49] quadprog_1.5-8     tools_4.2.3        cli_3.6.2          magrittr_2.0.3    
[53] lazyeval_0.2.2     tibble_3.2.1       crayon_1.5.2       tidyr_1.3.1       
[57] pkgconfig_2.0.3    MASS_7.3-60.0.1    Matrix_1.6-5       prettyunits_1.2.0 
[61] SQUAREM_2021.1     data.table_1.15.2  httr_1.4.7         R6_2.5.1          
[65] compiler_4.2.3

pcarbo commented 8 months ago

@aksarkar I tried your first example with the latest version of fastTopics (0.6-172), and did not get a segmentation fault (on my MacBook Pro Apple M2, R 4.3.3).

Segfaults are difficult to debug.

pcarbo commented 8 months ago

Okay I was able to generate a segfault with the following code:

library(Matrix)
library(fastTopics)
set.seed(1)
X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X
X <- X[,colSums(X > 0) > 0]
fit0 <- fit_poisson_nmf(X,k = 3,numiter = 10,method = "em",
                        init.method = "random")
fit_scd <- fit_poisson_nmf(X,fit0 = fit0,numiter = 10,method = "scd",
                           control = list(extrapolate = TRUE,nc = 2,
                                          nc.blas = 1),
                           verbose = "detailed")

If I set sparse = FALSE in the above code, the segfault does away.

Also, then rerunning this code in the same session with sparse = TRUE does not produce the segfault.

At this stage, I'm not sure how to debug this — it could be an issue caused by a recent update to one of the Rcpp packages, for example.

pcarbo commented 8 months ago

@aksarkar If you get any more potentially useful information about this bug, please post here.

aksarkar commented 7 months ago

@pcarbo https://github.com/RcppCore/RcppParallel/issues/110 appears to be relevant

pcarbo commented 7 months ago

Interesting discussion, but what I find odd is that the segfault only occurred for me when the input matrix was sparse.

aksarkar commented 7 months ago

If I build fastTopics with -DRCPP_PARALLEL_USE_TBB=0 then I don't get segfaults in either my example or yours.

I also get much faster performance with multiple threads in fit_poisson_nmf. I will do some more tests before recommending switching away from TBB.

pcarbo commented 7 months ago

My understanding is that RCPP_PARALLEL_USE_TBB=0 turns off multithreading (based on this), so I don't see how you can get faster performance with multiple cores when RCPP_PARALLEL_USE_TBB=0.

aksarkar commented 7 months ago

If RcppParallel does not use TBB, then it uses TinyThreads instead (https://github.com/RcppCore/RcppParallel/blob/5aa08f8b546d2fa99372d02d7b6a50344fb09ff3/inst/include/RcppParallel.h#L46). TinyThreads uses OS threads and relies on OS scheduling.

Another thing I noticed is that almost all of the compute time is spent in apparently serial code running on one core. I think addressing that will address the lack of performance gain I observed.

stephenslab / fastTopics

Segmentation fault in multithreading setup #50