Open aksarkar opened 8 months ago
@aksarkar I tried your first example with the latest version of fastTopics
(0.6-172), and did not get a segmentation fault (on my MacBook Pro Apple M2, R 4.3.3).
Segfaults are difficult to debug.
Okay I was able to generate a segfault with the following code:
library(Matrix)
library(fastTopics)
set.seed(1)
X <- simulate_count_data(80,100,k = 3,sparse = TRUE)$X
X <- X[,colSums(X > 0) > 0]
fit0 <- fit_poisson_nmf(X,k = 3,numiter = 10,method = "em",
init.method = "random")
fit_scd <- fit_poisson_nmf(X,fit0 = fit0,numiter = 10,method = "scd",
control = list(extrapolate = TRUE,nc = 2,
nc.blas = 1),
verbose = "detailed")
If I set sparse = FALSE
in the above code, the segfault does away.
Also, then rerunning this code in the same session with sparse = TRUE
does not produce the segfault.
At this stage, I'm not sure how to debug this — it could be an issue caused by a recent update to one of the Rcpp packages, for example.
@aksarkar If you get any more potentially useful information about this bug, please post here.
@pcarbo https://github.com/RcppCore/RcppParallel/issues/110 appears to be relevant
Interesting discussion, but what I find odd is that the segfault only occurred for me when the input matrix was sparse.
If I build fastTopics with -DRCPP_PARALLEL_USE_TBB=0
then I don't get segfaults in either my example or yours.
I also get much faster performance with multiple threads in fit_poisson_nmf
. I will do some more tests before recommending switching away from TBB.
My understanding is that RCPP_PARALLEL_USE_TBB=0
turns off multithreading (based on this), so I don't see how you can get faster performance with multiple cores when RCPP_PARALLEL_USE_TBB=0
.
If RcppParallel does not use TBB, then it uses TinyThreads instead (https://github.com/RcppCore/RcppParallel/blob/5aa08f8b546d2fa99372d02d7b6a50344fb09ff3/inst/include/RcppParallel.h#L46). TinyThreads uses OS threads and relies on OS scheduling.
Another thing I noticed is that almost all of the compute time is spent in apparently serial code running on one core. I think addressing that will address the lack of performance gain I observed.
The following example leads to a segfault
when run like this
The same occurs for any other choice of
method
. Interestingly, the following example does not: