skembel / picante

R tools for integrating phylogenies and ecology
33 stars 17 forks source link

Modifications to pd() function to improve speed #7

Closed mmuscarella closed 7 years ago

mmuscarella commented 8 years ago

I modified the pd function and drastically increased speed. Improvements are based on replacing values in a vector instead of concatenating inside a loop, and I reduced the number of times node.age() was run by putting a global node.age function outside of the loop. Performance increases are shown below:

> # pd = original code; pd2 = modified code
> 
> # Phylomom Data via picante
> dim(phylocom$sample)
[1]  6 25
> 
> microbenchmark(pd(phylocom$sample, phylocom$phylo, include.root = F), 
+                pd2(phylocom$sample, phylocom$phylo, include.root = F), times = 3)
Unit: milliseconds
                                                   expr      min       lq     mean   median
  pd(phylocom$sample, phylocom$phylo, include.root = F) 3.486191 3.549235 3.642661 3.612280
 pd2(phylocom$sample, phylocom$phylo, include.root = F) 4.365821 4.527785 4.898558 4.689748
       uq      max neval
 3.720897 3.829513     3
 5.164926 5.640104     3
> 
> microbenchmark(pd(phylocom$sample, phylocom$phylo, include.root = T), 
+                pd2(phylocom$sample, phylocom$phylo, include.root = T), times = 3)
Unit: milliseconds
                                                   expr      min        lq      mean
  pd(phylocom$sample, phylocom$phylo, include.root = T) 10.48036 10.513766 10.639581
 pd2(phylocom$sample, phylocom$phylo, include.root = T)  6.07949  6.163944  6.301157
    median       uq       max neval
 10.547176 10.71919 10.891209     3
  6.248397  6.41199  6.575584     3
> 
> # OTU table and phylogeny rarefied to 3,000 obs per site
> dim(OTU.2)
[1]    58 17314
> 
> microbenchmark(pd(OTU.2, OTU.tre.2, include.root = F), 
+                pd2(OTU.2, OTU.tre.2, include.root = F), times = 3)
Unit: seconds
                                    expr      min       lq     mean   median       uq
  pd(OTU.2, OTU.tre.2, include.root = F) 33.23346 33.28569 33.71187 33.33791 33.95107
 pd2(OTU.2, OTU.tre.2, include.root = F) 51.13831 51.23522 51.27740 51.33214 51.34694
      max neval
 34.56423     3
 51.36174     3
> 
> 
> microbenchmark(pd(OTU.2, OTU.tre.2, include.root = T), 
+                pd2(OTU.2, OTU.tre.2, include.root = T), times = 3)
Unit: seconds
                                    expr        min         lq       mean    median
  pd(OTU.2, OTU.tre.2, include.root = T) 1102.04643 1109.72308 1116.85816 1117.3997
 pd2(OTU.2, OTU.tre.2, include.root = T)   54.08965   56.48848   58.33705   58.8873
         uq       max neval
 1124.26402 1131.1283     3
   60.46075   62.0342     3
mmuscarella commented 8 years ago

I identified one issue that I caused because I used node.age in the beginning of the function. I now put it inside an if statement so that it only runs when include.root == TRUE and the tree is rooted.

> microbenchmark(pd(phylocom$sample, phylocom$phylo, include.root = F), 
+                pd2(phylocom$sample, phylocom$phylo, include.root = F), times = 3)
Unit: milliseconds
                                                   expr      min       lq     mean   median
  pd(phylocom$sample, phylocom$phylo, include.root = F) 3.912658 3.962499 4.064210 4.012341
 pd2(phylocom$sample, phylocom$phylo, include.root = F) 3.669715 3.730180 4.435009 3.790644
       uq      max neval
 4.139987 4.267632     3
 4.817657 5.844669     3
> 
> microbenchmark(pd(phylocom$sample, phylocom$phylo, include.root = T), 
+                pd2(phylocom$sample, phylocom$phylo, include.root = T), times = 3)
Unit: milliseconds
                                                   expr       min        lq      mean
  pd(phylocom$sample, phylocom$phylo, include.root = T) 10.193854 10.223595 11.001160
 pd2(phylocom$sample, phylocom$phylo, include.root = T)  5.891472  5.945265  6.284063
    median        uq       max neval
 10.253336 11.404812 12.556289     3
  5.999057  6.480358  6.961659     3