Parallelization - Githubissues

ruiye88 commented 1 week ago

Hi,

Thank you for developing this wonderful tool. Just curious are you planning to add parallel computing options for the function?

Rui

pcarbo commented 1 week ago

Thanks for your interest @ruiye88. Have you tried the current implementation on your data set? Is it too slow? Could you tell us a little bit more about the size of your data set?

ruiye88 commented 1 week ago

Hi Peter, thanks for the quick response. I tried a test run on a subset of my dataset (~1500 cells, maxiter1 = 100,maxiter2 = 50, maxiter3 = 50) and it takes about 20-30 minutes. My full dataset has ~50K cells. Do you have an estimate of how long it might take to run the full dataset? Also, I'm assuming most users may want to run multiple Kmax values to compare the results. Therefore, it would be really helpful if the parallel computation could be implemented.

pcarbo commented 1 week ago

@ruiye88 In the paper, we ran on a dataset containing ~35,000 cells, which is quite comparable to your dataset. Does your counts matrix have a high proportion of zeros and, of so, is it encoded as a sparse matrix? My understanding is that if your Y matrix has many rows and is sparse, gbcd will run faster. In particular, it runs the more efficient method that does not compute the (dense) N x N covariance matrix if this condition is satisfied:

2 * ncol(Y) * mean(Y > 0) < nrow(Y)

For us, the more efficient implementation ran on the dataset with ~35,000 cells in about 20 h.

You could potentially also run on multiple Kmax values in parallel (e.g., using mclapply), although it may use a lot of memory.

There is some support for parallel computations in the current implementation if you have installed R with a version of the BLAS library that supports multithreading, such as OpenBLAS or Intel MKL, that that should speed things up a bit, although it is more important to make sure your data are encoded properly as a sparse matrix.

Hope this helps.

stephenslab / gbcd

Parallelization #4