Closed rBatt closed 8 years ago
Can find the stan model here, too: https://github.com/rBatt/trawl/blob/stan/trawlDiversity/inst/stan/msomStatic.stan
But also in the files linked in original post of this issue
For reference,
ebs_msom <- stan(
file=model_file,
data=staticData,
control=list(stepsize=0.05, adapt_delta=0.95),
chains=4, iter=50, refresh=1, seed=1337, cores=4
)
ran in about 57 seconds on my laptop. Few warnings etc. Max tree depth, e.g.
I think the time to beat here is 1 minute. As it stands, the model takes about 1 second per iteration (discounting the fact that it's really 4 per second because 4 chains in parallel).
It'd be nice to see this get down to 1 second or so. Getting in the 20 second range would be great though. Let's see what we can do!
The same code as above ran in 33.8 seconds on my home computer.
I think it's a RAM issue. I think my laptop was using Swap
Made change:
Don't need the if () statements to avoid going into loop when K = 0; for (k in 1:0) { will skip the loop anyway
Time on home system: 34.4 sec (second run 34.0 sec)
Made change:
Move 1 ~ Bernoulli(Omega[t]) out a few levels: maybe 1 ~ bernoulli(Omega[t]) * N * sum(nK[t]) will work
Time on home system: 33.4 sec (rerun in 34.3 sec)
cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))) warmup sample
chain:1 14.796288 19.58026 34.37655 chain:2 6.187283 15.94436 22.13164 chain:3 10.091355 10.51002 20.60138 chain:4 6.839183 12.13200 18.97118
Effective sample size:
mean(summary(ebs_msom)$summary[,"n_eff"]) [1] 31.6848
sd(summary(ebs_msom)$summary[,"n_eff"]) [1] 36.25296
33.4 sec per 31.7 samples, or 1.05 seconds per sample
Vectorize over alpha and beta
It didn't require a transpose. Just made the prior for 1 row_vector at a time (vectorizing over species).
(timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom)))) warmup sample
chain:1 14.508140 17.141421 31.64956 chain:2 10.410128 17.737952 28.14808 chain:3 10.665738 19.064648 29.73039 chain:4 6.256854 6.462953 12.71981 (max_time <- max(timing)) [1] 31.64956 (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"])) [1] 35.43705 (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"])) [1] 35.58054 max_time/neff_mu [1] 0.8931207
31.6 sec
0.89 sec per sample
doing cauchy(0,1) for BOTH mu and sd decreased performance (run time went down by ~1 sec, but seconds per sample increased threefold). Thus, i tried a cauchy(0,2) on the sd, and a cauchy (0,1) on the mean, and things had a small improvement:
(timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom)))) warmup sample
chain:1 13.369595 19.504919 32.87451 chain:2 6.808201 7.645603 14.45380 chain:3 6.479953 15.585501 22.06545 chain:4 6.024354 8.072957 14.09731 (max_time <- max(timing)) [1] 32.87451 (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"])) [1] 37.9292 (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"])) [1] 32.75176 max_time/neff_mu [1] 0.8667337
32.8 sec
0.87 sec per sample
Note: I've moved several comments from this Issue to #100
I made Psi and Theta local variable. This improved nEff, but not runtime. Overall improvement, by double.
> # =========================
> # = Timing and Efficiency =
> # =========================
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
warmup sample
chain:1 12.543997 19.05509 31.59908
chain:2 7.660934 17.34197 25.00291
chain:3 4.994482 11.18956 16.18404
chain:4 6.681883 10.12041 16.80229
> (max_time <- max(timing))
[1] 31.59908
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 60.41214
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 41.35934
> max_time/neff_mu
[1] 0.5230585
31.6 sec 0.52 sec per sample
Again, didn't need transpose, but I am now able to still use matrix algebra for calculating each psi and theta. See above commit message. In short, I adapt the size of the covariate matrix to the number of K for that site-year, so I'm not multiplying out a bunch of empty values and doing a bunch of expensive computations to create unused/ empty elements of Psi and Theta. Pretty great compromise here.
> # =========================
> # = Timing and Efficiency =
> # =========================
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
warmup sample
chain:1 11.553115 17.458463 29.01158
chain:2 7.164168 16.078000 23.24217
chain:3 4.595384 10.303367 14.89875
chain:4 6.261785 9.526897 15.78868
> (max_time <- max(timing))
[1] 29.01158
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 60.41214
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 41.35934
> max_time/neff_mu
[1] 0.4802276
29.0 sec 0.48 sec per sample
9271c9d0f31f3f049f5c8225c6e41ff4b0fa39cb
Tried vectorizing the lp for obs and unobs spp (excludes never observed species). It was a little bit faster before I tried fiddling with the functions (I think seconds per neff was around 0.56, and total time around 21 sec).
This is ~70% of the time, but slightly worse time per effective sample.
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
warmup sample
chain:1 10.782976 12.161932 22.944908
chain:2 7.041526 4.151737 11.193263
chain:3 2.299043 3.452396 5.751439
chain:4 1.762490 2.797118 4.559608
> (max_time <- max(timing))
[1] 22.94491
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 31.77707
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 36.31423
> max_time/neff_mu
[1] 0.7220587
22.9 sec 0.72 sec per sample
I used non-centered distributions for alpha and beta. This means I can do alpha_raw ~ normal(0, 1)
, then I use alpha <- alpha_mu + alpha_sd*alpha_raw
as I had previously used alpha. I was very surprised and pleased with the speed up.
Oh, I also decreased step size to 0.01. This helped a bit, too.
> # =========================
> # = Timing and Efficiency =
> # =========================
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
warmup sample
chain:1 0.698877 0.676139 1.375016
chain:2 1.025372 1.382666 2.408038
chain:3 3.863502 1.006282 4.869784
chain:4 2.263946 1.204742 3.468688
> (max_time <- max(timing))
[1] 4.869784
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 66.93933
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 37.45098
> max_time/neff_mu
[1] 0.07274922
4.9 sec 0.07 sec per sample
Note: I think the timings are being slowed down by the intense usage of my processor as I run the full-EBS model (it's taking ~1.5 days so far, and I want to let it finish). When i run this non-centered version with only 1 chain and 1 core, it takes ~1 second.
Making Omega hierarchical was really slow (~1 order of magnitude), because now had the hyperparameters, too. The only difference is that the hierarchy doesn't specify what the hyperdistribution is, b/c its parameters are fit instead.
I noticed that I made an error in my most recent version of the model, after aggregating the presence process to the site level (still incrementing log_inv_logit(logit_psi)
for each sample, instead of just 1 time per site).
I aggregated the covariates among samples within a site-year, even for the detection covariates. I then modelled the covariates as random variables, using the sample mean and standard deviations. This made it easy to fix errors I had before (statistical, not programming) where I was iterating over the term lil_lp
(which is log_inv_logit(logit_psi)
) for each sample ... yikes! Presence is only defined at the site-year level in this model (this error was due to forgetting to fix this formulation from back when I had presence being modelled down to the sample level).
All said, this was a huge speed-up.
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
warmup sample
chain:1 0.272825 0.198744 0.471569
chain:2 0.401100 0.259313 0.660413
chain:3 1.533286 0.451326 1.984612
chain:4 0.296010 0.412060 0.708070
> (max_time <- max(timing))
[1] 1.984612
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 77.53699
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 33.56467
> max_time/neff_mu
[1] 0.02559568
1.98 sec 0.03 sec per sample
Note: Like in previous timing, I think this set of stats might be misrepresenting the actual speed of this model. If I run 1 chain on 1 core, it goes in 0.48 seconds . Also, it should be noted that I have suspicions that this model will scale really well, especially for situations with large numbers of species (likelihoods are vectorized over species) or large numbers of samples (K; using binomial for detection, so this can be thought of as being really efficient at handling larger K ... almost trivial to increase number of hauls). I think this particular model will scale most poorly with site (J; there is some matrix algebra with the rows of the matrices being size J, so that could get slow if J gets huge, and also have to loop over J), and will scale pretty linearly with increased years (T; but the slope of this line will probably be less than 1).
OK, I think the efficiency issue has been addressed. Everything to play with now is basically just model structure, which could effect the speed, but it outside the scope of this issue. I think the low-hanging fruit of optimization has been harvested.
https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!topic/stan-users/xMDiFDQB0SU
@mtingley @mpinsky At some point it might be worth taking a look at my first draft of the new model. There are a few things that I'm not happy with yet. And I still haven't implemented the more dynamic version (colonization, persistence) that Morgan and I talked about a while ago. I was torn about how to keep years separate, or not. Also the interpretation of "K" has changed a bit.
Once the code runs faster (if it does), it'll be easier to experiment.
The overarching to-do's for efficiency:
logit_psi
andlogit_theta
local variables, and don't track all the way throughalpha[u,s] ~ normal(
and similar for beta; requires transposelogit_psi[t,j,k] <- U[t,j,k]*alpha;
type statements. See: