ggloor / ALDEx_bioc

ALDEx_bioc is the working directory for updating bioconductor
27 stars 13 forks source link

how to run ALDEX2 on multicore ? #32

Closed MonicaSteffi closed 3 years ago

MonicaSteffi commented 3 years ago

Dear All, I have a picrust output for a huge dataset. It has with 4000 KO and 4570 samples. I would like to run differential abundance testing on top of that. Since it will take a lots of time, I would like to do it on multicore. I would like to run it slurm.

This is my slurm code:

#!/bin/bash
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=20
#SBATCH --output=picrust.txt
#SBATCH --job-name=aldex2
#SBATCH --mem-per-cpu=10000
sbatch r_container.sh

I know there is a option useMC=TRUE in aldex to enable the multicore. In my R_script, I gave the following command.

Region_pathabun <- aldex(round(Region_path), conds, effect=TRUE, useMC=TRUE)

But it was still running on single core. How do I specificity number of cores in aldex function?

ggloor commented 3 years ago

Hi monica what a great dataset. Not all functions in aldex have multicore support.

I would suggest with a dataset where there are as many samples as features to reduce the number of montecarlo replicates to 8 or 16 as you actually are capturing a lot of the variance with the biosamples. so you could try mc.samples=16 to reduce run time.

I will work on including a multicore environment for all functions

MonicaSteffi commented 3 years ago

Thank you.

ggloor commented 3 years ago

Hi Monica, The bad news: in looking through the code a bit more, it seems that most functions that can be run in parallel are set up to use multicore. The real issue is that the bottlenecks in the code are not those that are trivial to parallelize. The good news: there are a number of bottlenecks that I think I can address. Thanks for making me look through the code again, I think I can do some speed enhancements

MonicaSteffi commented 3 years ago

Hi Monica, The bad news: in looking through the code a bit more, it seems that most functions that can be run in parallel are set up to use multicore. The real issue is that the bottlenecks in the code are not those that are trivial to parallelize. The good news: there are a number of bottlenecks that I think I can address. Thanks for making me look through the code again, I think I can do some speed enhancements

Hi @ggloor

Thank you so much.

Somehow I managed to run aldex2 for my dataset. First I run aldex.clr() module followed by ttest() module. However, Now I am facing different problem. I got only NaN for the wi test. But got the value for we test.

                    we.ep |  we.eBH | wi.ep | wi.eBH |  
PWY-7013 | 7.47E-01 | 7.50E-01 | NaN | NaN
PWY-5971 | 6.91E-01 | 6.95E-01 | NaN | NaN
PWY-6630 | 6.21E-01 | 6.26E-01 | NaN | NaN
PWY-5531 | 6.15E-01 | 6.19E-01 | NaN | NaN
PWY-7159 | 3.87E-01 | 3.91E-01 | NaN | NaN
PWY-7090 | 2.85E-01 | 2.89E-01 | NaN | NaN

I have attached my output rds. How do I trace the error? https://www.dropbox.com/s/v2lo6b0g84q8odq/Region_South_North_KOabun_in_filt_aldex_1.rds?dl=0

Str of my data:

data.frame':    4000 obs. of  4527 variables:
 $ run5.326. : num  6623.02 2.05 0 33.59 3836.14 ...
 $ run5.327. : num  7293.73 1.41 0 19.89 4295.43 ...
 $ run5.328. : num  6929.47 2.69 0 0 3538.15 ...
 $ run5.329. : num  6311.76 1.12 0 0 3208.59 ...
 $ run5.330. : num  6533.48 1.94 0 0 4170.34 ...

I executed the following command

Region_clr<- aldex.clr(round(Region_pathabun_in), conds, denom="all", verbose=TRUE, useMC = TRUE )
Region_aldex<-aldex.ttest(Region_clr, paired.test = FALSE, hist.plot = FALSE, verbose = TRUE)