rprops / Phenoflow_package

R package offering functionality for the advanced analysis of microbial flow cytometry data
GNU General Public License v2.0
9 stars 5 forks source link

Parallelize Diversity_16S #26

Closed FMKerckhof closed 7 years ago

FMKerckhof commented 7 years ago

Consider using parallel::mclapply to speed up the computation by putting different samples on different nodes. It could replace the for loop at https://github.com/rprops/Phenoflow_package/blob/master/R/Diversity_16S.R#L44

However:

In the long run using Rcpp would be the more efficient solution here.

rprops commented 7 years ago

Current bottleneck is resampling of individual samples. Should look into using apply functions to enhance this..

FMKerckhof commented 7 years ago

I agree, however as long as rarefy_even_depth is in use, it will be slow. I think vegan's rrarefy is actually faster, however it shouldn't be (looking at the code), except maybe that all the phyloseq checks and constructors in rarefy_even_depth take more time?

rprops commented 7 years ago

Problem is probably the dataframe structure used in phyloseq, data tables or matrices should be much faster to access

FMKerckhof commented 7 years ago

Is there a way to give the DIV matrix attributes such as the options selected? E.g. make sure the number of bootstraps are "included" in the data structure (as a matrix attribute)?

rprops commented 7 years ago

Yes, will commit for next release

rprops commented 7 years ago

@FMKerckhof Have added attributes to output dataframe. These might be removed by certain processing steps though..

> Diversity.clean <- Diversity_rf(flowData_transformed_all, param = param, R = 3, R.b = 3, cleanFCS = TRUE, cleanparam = c(9,11))
Tue Apr 11 09:10:51 2017 --- Using the following parameters for removing errant collection events
 in samples with > 30,000 cells: FL1-H FL3-H
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_0_0h_SYBR_START_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_0_10h_SYBR_START_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_0_4h_SYBR_START_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_0_6h_SYBR_START_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_1_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_17_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_18_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_19_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_2_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_24_A_SYBR_POSTCYCLE_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_25_A_SYBR_POSTCYCLE_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_26_A_SYBR_POSTCYCLE_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_3_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_4_A_SYBR_OPERATION_rep1.fcs."
[1] "flowClean detected no problems in test_data/CYCLUS1_BEKKEN_5_A_SYBR_OPERATION_rep1.fcs."
Tue Apr 11 09:14:01 2017 --- Done with cleaning data
Tue Apr 11 09:14:01 2017 --- Starting resample run 1
Tue Apr 11 09:14:09 2017 --- Starting resample run 2
Tue Apr 11 09:14:17 2017 --- Starting resample run 3
Tue Apr 11 09:14:25 2017 --- Alpha diversity metrics (D0,D1,D2) have been computed after 3 bootstraps
There were 15 warnings (use warnings() to see them)
> attributes( Diversity.clean )
$names
[1] "Sample_names" "D0"           "D1"           "D2"           "sd.D0"        "sd.D1"        "sd.D2"       

$row.names
 [1] "CYCLUS1_BEKKEN_-1_A_SYBR_PRECYCLE_rep1.fcs"   "CYCLUS1_BEKKEN_-3_A_SYBR_PRECYCLE_rep1.fcs"  
 [3] "CYCLUS1_BEKKEN_-5_A_SYBR_PRECYCLE_rep1.fcs"   "CYCLUS1_BEKKEN_-7_A_SYBR_PRECYCLE_rep1.fcs"  
 [5] "CYCLUS1_BEKKEN_0_0h_SYBR_START_rep1.fcs"      "CYCLUS1_BEKKEN_0_10h_SYBR_START_rep1.fcs"    
 [7] "CYCLUS1_BEKKEN_0_2h_SYBR_START_rep1.fcs"      "CYCLUS1_BEKKEN_0_4h_SYBR_START_rep1.fcs"     
 [9] "CYCLUS1_BEKKEN_0_6h_SYBR_START_rep1.fcs"      "CYCLUS1_BEKKEN_1_A_SYBR_OPERATION_rep1.fcs"  
[11] "CYCLUS1_BEKKEN_10_A_SYBR_OPERATION_rep1.fcs"  "CYCLUS1_BEKKEN_11_A_SYBR_OPERATION_rep1.fcs" 
[13] "CYCLUS1_BEKKEN_12_A_SYBR_OPERATION_rep1.fcs"  "CYCLUS1_BEKKEN_13_A_SYBR_OPERATION_rep1.fcs" 
[15] "CYCLUS1_BEKKEN_14_A_SYBR_OPERATION_rep1.fcs"  "CYCLUS1_BEKKEN_15_A_SYBR_OPERATION_rep1.fcs" 
[17] "CYCLUS1_BEKKEN_16_A_SYBR_OPERATION_rep1.fcs"  "CYCLUS1_BEKKEN_17_A_SYBR_OPERATION_rep1.fcs" 
[19] "CYCLUS1_BEKKEN_18_A_SYBR_OPERATION_rep1.fcs"  "CYCLUS1_BEKKEN_19_A_SYBR_OPERATION_rep1.fcs" 
[21] "CYCLUS1_BEKKEN_2_A_SYBR_OPERATION_rep1.fcs"   "CYCLUS1_BEKKEN_20_A_SYBR_OPERATION_rep1.fcs" 
[23] "CYCLUS1_BEKKEN_21_0h_SYBR_SHUTDOWN_rep1.fcs"  "CYCLUS1_BEKKEN_21_10h_SYBR_SHUTDOWN_rep1.fcs"
[25] "CYCLUS1_BEKKEN_21_1h_SYBR_SHUTDOWN_rep1.fcs"  "CYCLUS1_BEKKEN_21_2h_SYBR_SHUTDOWN_rep1.fcs" 
[27] "CYCLUS1_BEKKEN_21_4h_SYBR_SHUTDOWN_rep1.fcs"  "CYCLUS1_BEKKEN_21_6h_SYBR_SHUTDOWN_rep1.fcs" 
[29] "CYCLUS1_BEKKEN_21_8h_SYBR_SHUTDOWN_rep1.fcs"  "CYCLUS1_BEKKEN_22_A_SYBR_POSTCYCLE_rep1.fcs" 
[31] "CYCLUS1_BEKKEN_23_A_SYBR_POSTCYCLE_rep1.fcs"  "CYCLUS1_BEKKEN_24_A_SYBR_POSTCYCLE_rep1.fcs" 
[33] "CYCLUS1_BEKKEN_25_A_SYBR_POSTCYCLE_rep1.fcs"  "CYCLUS1_BEKKEN_26_A_SYBR_POSTCYCLE_rep1.fcs" 
[35] "CYCLUS1_BEKKEN_3_A_SYBR_OPERATION_rep1.fcs"   "CYCLUS1_BEKKEN_4_A_SYBR_OPERATION_rep1.fcs"  
[37] "CYCLUS1_BEKKEN_5_A_SYBR_OPERATION_rep1.fcs"   "CYCLUS1_BEKKEN_6_A_SYBR_OPERATION_rep1.fcs"  
[39] "CYCLUS1_BEKKEN_7_A_SYBR_OPERATION_rep1.fcs"   "CYCLUS1_BEKKEN_8_A_SYBR_OPERATION_rep1.fcs"  
[41] "CYCLUS1_BEKKEN_9_A_SYBR_OPERATION_rep1.fcs"  

$class
[1] "data.frame"

$R
[1] 3

$R.b
[1] 3

$bw
[1] 0.01

$nbin
[1] 128

$d
[1] 4

$cleanFCS
[1] TRUE

$cleanparam
[1] "FL1-H" "FL3-H"
rprops commented 7 years ago

@FMKerckhof Just an FYI - I am implementing parallel and foreach into all Diversity functions with linux and windows compatibility. Should be done soon.

rprops commented 7 years ago

@FMKerckhof Done for Diversity_16S function. Takes around 5 - 10s for each sample with R = 100 and ncores = 20. Tested on:

> system.time(Diversity_16S(x, parallel = TRUE, ncores = 20, R = 100))
        **WARNING** this functions assumes that rows are samples and columns
        are taxa in your phyloseq object, please verify.
Thu Apr 20 05:47:00 2017        Using 20 cores for calculations
...
Thu Apr 20 06:00:21 2017        Closing connection to cores
Thu Apr 20 06:00:21 2017        Done with all 79 samples
   user  system elapsed
 63.252   7.844 804.937
rprops commented 7 years ago

Phenotypic diversity calculations can now also be parallelized. Will close this issue.