lamho86 / phylolm

GNU General Public License v2.0
30 stars 12 forks source link

Implement parallel bootstrap to lm (#10). #11

Closed Ax3man closed 6 years ago

Ax3man commented 6 years ago

Following issue 10, I now finally had some time to implement this.

Can you please have a look whether you think this is worthwhile, and if you agree with the implementation? Specifically:

If you think this is useful, I can do the same for phyloglm.

It was mostly easy, using a function with lapply rather than a for loop. Additionally, minus2llh now takes a y values, since otherwise scoping rules dictate the use of the value of y at the point of creation of that function (during the normal model fitting).

Results between this and the original function remain the same:

#create sample data for tests (based on example in ?phylolm)
set.seed(123456)
b0=0; b1=1;

tre_small = rcoal(50)
taxa_small = sort(tre_small$tip.label)
x_small <- rTrait(n=1, phy=tre_small,model="BM",
                  parameters=list(ancestral.state=0,sigma2=10))
y_small <- b0 + b1*x_small + 
  rTrait(n=1,phy=tre_small,model="lambda",parameters=list(
    ancestral.state=0,sigma2=1,lambda=0.5))
dat_small = data.frame(trait=y_small[taxa_small],pred=x_small[taxa_small])

tre_large = rcoal(500)
taxa_large = sort(tre_large$tip.label)
x_large <- rTrait(n=1, phy=tre_large,model="BM",
                  parameters=list(ancestral.state=0,sigma2=10))
y_large <- b0 + b1*x_large + 
  rTrait(n=1,phy=tre_large,model="lambda",parameters=list(
    ancestral.state=0,sigma2=1,lambda=0.5))
dat_large = data.frame(trait=y_large[taxa_large],pred=x_large[taxa_large])

## check results
library(dplyr)
library(tidyr)
library(ggplot2)
fit_old <- phylolm_current(trait~pred, data = dat_small, phy = tre_small, model = "lambda", boot = 1000)
fit_new <- phylolm(trait~pred, data = dat_small, phy = tre_small, model = "lambda", boot = 1000)
result <- bind_rows(old = as.data.frame(fit_old$bootstrap), new = as.data.frame(fit_new$bootstrap), 
                    .id = 'version')
ggplot(gather(result, 'parameter', 'value', -version), aes(version, value)) +
  geom_violin(draw_quantiles = c(0.25, 0.5, 0.75), fill = 1, alpha = 0.1) + 
  facet_wrap(~parameter, scales = 'free_y', nr = 1) +
  theme_minimal()

result_comparsion

For larger bootstrap samples, and for larger trees in general, parallelization creates a significant performance benefit (as expected):

library(microbenchmark)
mb <- microbenchmark(
  old_small_100 = phylolm_current(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 100),
  new_small_100 = phylolm(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 100),
  par_small_100 = phylolm(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 100, parallel = "SOCK"),

  new_small_1000 = phylolm(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 1000),
  par_small_1000 = phylolm(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 1000, parallel = "SOCK"),

  new_small_10000 = phylolm(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 10000),
  par_small_10000 = phylolm(trait~pred,data=dat_small,phy=tre_small,model="lambda", boot = 10000, parallel = "SOCK"),

  new_large_100 = phylolm(trait~pred,data=dat_large,phy=tre_large,model="lambda", boot = 100),
  par_large_100 = phylolm(trait~pred,data=dat_large,phy=tre_large,model="lambda", boot = 100, parallel = "SOCK"),

  new_large_1000 = phylolm(trait~pred,data=dat_large,phy=tre_large,model="lambda", boot = 1000),
  par_large_1000 = phylolm(trait~pred,data=dat_large,phy=tre_large,model="lambda", boot = 1000, parallel = "SOCK"),

  new_large_10000 = phylolm(trait~pred,data=dat_large,phy=tre_large,model="lambda", boot = 10000),
  par_large_10000 = phylolm(trait~pred,data=dat_large,phy=tre_large,model="lambda", boot = 10000, parallel = "SOCK"),
  times = 5
)
summary(mb) %>% 
  separate(expr, c('version', 'tree', 'boot_sample')) %>% 
  ggplot(aes(factor(boot_sample), median, ymin = min, ymax = max, color = version)) +
  geom_pointrange(position = position_dodge(0.3), size = 2/3) +
  geom_line(aes(group = version), position = position_dodge(0.3)) +
  scale_y_log10() +
  labs(title = 'runtime comparsion', x = 'size of bootstrap sample', color = 'code version',
       y = 'time (seconds)') +
  facet_wrap(~tree) +
  theme_minimal()

runtime

par is the parallel runs (7 cores) and new is the non-parallel runs. Right panel is a small tree (50 species), left panel is a larger tree (500 species).

(I included a run of the old code, with the for loop to demonstrate that performance is the same as the no-parallel default of the new function.)

lamho86 commented 6 years ago

Hi Wouter,

I think this is very useful. I have merged your commit and added you as a contributor. It would be great if you could do the same thing for phyloglm.

Best, Lam

HenrikBengtsson commented 6 years ago

May I tease you with the following alternative (disclaimer: I'm the author of future, future.apply, ...):

    bootmatrix <- future.apply::future_lapply(as.data.frame(booty), boot_model)
    bootmatrix <- do.call(rbind, bootmatrix)

i.e. replace lapply() with future_lapply(). Global variables should be taken of automatically.

Note also that user don't have to pass an argument parallel; instead user choose what kind of backend they wish to use - the default is sequential. For instance,

library(future)
plan(multiprocess)

will use PSOCK cluster on Windows and forked process on Linux/macOS (utilizing the parallel package). If the user got access to more machines, they can use:

plan(clusters, workers = c("n1", "n2", "n2", "n2"))

or an HPC scheduler;

plan(future.batchtools::batchtools_sge)
Ax3man commented 6 years ago

Sure, it's easy to do it that way instead. Also, if external dependencies are an issue, one can put future.apply in Suggests instead.

I can change the implementation to future, depending on what you prefer, @lamho86 & @cecileane.

lamho86 commented 6 years ago

@Ax3man I don't have a strong preference regarding this. It's up to you.