Improve Stan MSOM Efficiency

rBatt commented 8 years ago

https://groups.google.com/forum/?utm_source=digest&utm_medium=email#!topic/stan-users/xMDiFDQB0SU

@mtingley @mpinsky At some point it might be worth taking a look at my first draft of the new model. There are a few things that I'm not happy with yet. And I still haven't implemented the more dynamic version (colonization, persistence) that Morgan and I talked about a while ago. I was torn about how to keep years separate, or not. Also the interpretation of "K" has changed a bit.

Once the code runs faster (if it does), it'll be easier to experiment.

The overarching to-do's for efficiency:

[x] Make logit_psi and logit_theta local variables, and don't track all the way through
[x] Make Omega hierarchical, and/or put it on logit scale (I'm confused by what this means)
[x] vectorize alpha[u,s] ~ normal( and similar for beta; requires transpose
[x] when doing transpose for above, make appropriate transpose/ rearrange for logit_psi[t,j,k] <- U[t,j,k]*alpha; type statements. See:

    for (t in 1:nT) {
        for (j in 1:Jmax) {
            int K;
            K <- nK[t,j];
            if (K>0) { // if K is not 0
                // if the K matrix gets too large, 
                // it is in here that I should add
                // a loop, wherein I subset to U[t,j,k], and
                // vectorize over spp only
                for (k in 1:K) {
                    logit_psi[t,j,k] <- U[t,j,k]*alpha;
                    logit_theta[t,j,k] <- V[t,j,k]*beta;
                }
            }
        }
    }
As you can see from my comments, I changed my mind and decided to do the looping. Originally I wanted U[t,j] * alpha;. The downside is that with the potential for many missing K, I wasn't sure the matrix algebra would be a savings.

If I wanted to go back to the matrix algebra, I suppose I could do what you suggest, then also transpose the matrix portion of U and V (currently declared as matrix[Kmax,nU] U[nT,Jmax], switch to matrix[nU,Kmax] U[nT,Jmax]), then switch to alpha * U[t,j. I'd have to check to make sure that'd work out.

[x] Don't need the if () statements to avoid going into loop when K = 0; for (k in 1:0) { will skip the loop anyway

[x] Move 1 ~ Bernoulli(Omega[t]) out a few levels: maybe 1 ~ bernoulli(Omega[t]) * N * sum(nK[t]) will work

[x] Try less fat-tailed priors; maybe cauchy(0, 1) instead of cauchy(0, 2.5).

[x] See if lp_obs(X[t,j,k,n], logit_psi[t,j,k,n], logit_theta[t,j,k,n]) can be vectorized. See:

I think I’m running into at least 2 potential problems vectorizing these pieces. First, the lp_obs function can only be applied to the first N elements of that fourth dimension; i.e., to X[t,j,k, 1:N] … which is R syntax, and which I am unsure of how to do in Stan. On page 353 of the manual I see the vector and array slicing operations, but I didn’t understand them right away. Taking a second look, I see that maybe segment(X[t,j,k], 1, N) would give the desired result for vectorization. I’ve read elsewhere that in many cases the copy command induced by segment is a small loss in performance compared to the performance gained by vectorizing. So I see how this could be handy.

Second, this entire process could be greatly simplified if I could condense the lp_obs and lp_unobs functions into 1. These are your functions, which I simplified to my needs. After simplification, I didn’t see how to easily condense them into 1 function. It’s because of that log1m_inv_logit(logit_psi) term in the lp_unobs function —- I have to cancel it out if x is 1, and I can’t think of a way to do that, unless something like value * (x==0) can take advantage of TRUE being 1, and FALSE being 0, and then working for algebra (like in R). Any ideas?

rBatt commented 8 years ago

Can find the stan model here, too: https://github.com/rBatt/trawl/blob/stan/trawlDiversity/inst/stan/msomStatic.stan

But also in the files linked in original post of this issue

rBatt commented 8 years ago

For reference,

ebs_msom <- stan(
    file=model_file, 
    data=staticData, 
    control=list(stepsize=0.05, adapt_delta=0.95),
    chains=4, iter=50, refresh=1, seed=1337, cores=4
)

ran in about 57 seconds on my laptop. Few warnings etc. Max tree depth, e.g.

I think the time to beat here is 1 minute. As it stands, the model takes about 1 second per iteration (discounting the fact that it's really 4 per second because 4 chains in parallel).

It'd be nice to see this get down to 1 second or so. Getting in the 20 second range would be great though. Let's see what we can do!

rBatt commented 8 years ago

The same code as above ran in 33.8 seconds on my home computer.

I think it's a RAM issue. I think my laptop was using Swap

rBatt commented 8 years ago

Made change:

Don't need the if () statements to avoid going into loop when K = 0; for (k in 1:0) { will skip the loop anyway

Time on home system: 34.4 sec (second run 34.0 sec)

rBatt commented 8 years ago

Made change:

Move 1 ~ Bernoulli(Omega[t]) out a few levels: maybe 1 ~ bernoulli(Omega[t]) * N * sum(nK[t]) will work

Time on home system: 33.4 sec (rerun in 34.3 sec)

cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))) warmup sample
chain:1 14.796288 19.58026 34.37655 chain:2 6.187283 15.94436 22.13164 chain:3 10.091355 10.51002 20.60138 chain:4 6.839183 12.13200 18.97118

Effective sample size:

mean(summary(ebs_msom)$summary[,"n_eff"]) [1] 31.6848

sd(summary(ebs_msom)$summary[,"n_eff"]) [1] 36.25296

Summary

33.4 sec per 31.7 samples, or 1.05 seconds per sample

rBatt commented 8 years ago

Vectorize over alpha and beta

It didn't require a transpose. Just made the prior for 1 row_vector at a time (vectorizing over species).

(timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom)))) warmup sample
chain:1 14.508140 17.141421 31.64956 chain:2 10.410128 17.737952 28.14808 chain:3 10.665738 19.064648 29.73039 chain:4 6.256854 6.462953 12.71981 (max_time <- max(timing)) [1] 31.64956 (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"])) [1] 35.43705 (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"])) [1] 35.58054 max_time/neff_mu [1] 0.8931207

31.6 sec

0.89 sec per sample

rBatt commented 8 years ago

doing cauchy(0,1) for BOTH mu and sd decreased performance (run time went down by ~1 sec, but seconds per sample increased threefold). Thus, i tried a cauchy(0,2) on the sd, and a cauchy (0,1) on the mean, and things had a small improvement:

(timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom)))) warmup sample
chain:1 13.369595 19.504919 32.87451 chain:2 6.808201 7.645603 14.45380 chain:3 6.479953 15.585501 22.06545 chain:4 6.024354 8.072957 14.09731 (max_time <- max(timing)) [1] 32.87451 (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"])) [1] 37.9292 (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"])) [1] 32.75176 max_time/neff_mu [1] 0.8667337

32.8 sec

0.87 sec per sample

rBatt commented 8 years ago

Note: I've moved several comments from this Issue to #100

rBatt commented 8 years ago

I made Psi and Theta local variable. This improved nEff, but not runtime. Overall improvement, by double.

> # =========================
> # = Timing and Efficiency =
> # =========================
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
           warmup   sample         
chain:1 12.543997 19.05509 31.59908
chain:2  7.660934 17.34197 25.00291
chain:3  4.994482 11.18956 16.18404
chain:4  6.681883 10.12041 16.80229
> (max_time <- max(timing))
[1] 31.59908
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 60.41214
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 41.35934
> max_time/neff_mu
[1] 0.5230585

31.6 sec 0.52 sec per sample

rBatt commented 8 years ago

Again, didn't need transpose, but I am now able to still use matrix algebra for calculating each psi and theta. See above commit message. In short, I adapt the size of the covariate matrix to the number of K for that site-year, so I'm not multiplying out a bunch of empty values and doing a bunch of expensive computations to create unused/ empty elements of Psi and Theta. Pretty great compromise here.

> # =========================
> # = Timing and Efficiency =
> # =========================
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
           warmup    sample         
chain:1 11.553115 17.458463 29.01158
chain:2  7.164168 16.078000 23.24217
chain:3  4.595384 10.303367 14.89875
chain:4  6.261785  9.526897 15.78868
> (max_time <- max(timing))
[1] 29.01158
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 60.41214
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 41.35934
> max_time/neff_mu
[1] 0.4802276

29.0 sec 0.48 sec per sample

rBatt commented 8 years ago

9271c9d0f31f3f049f5c8225c6e41ff4b0fa39cb

Tried vectorizing the lp for obs and unobs spp (excludes never observed species). It was a little bit faster before I tried fiddling with the functions (I think seconds per neff was around 0.56, and total time around 21 sec).

This is ~70% of the time, but slightly worse time per effective sample.

> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
           warmup    sample          
chain:1 10.782976 12.161932 22.944908
chain:2  7.041526  4.151737 11.193263
chain:3  2.299043  3.452396  5.751439
chain:4  1.762490  2.797118  4.559608
> (max_time <- max(timing))
[1] 22.94491
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 31.77707
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 36.31423
> max_time/neff_mu
[1] 0.7220587

22.9 sec 0.72 sec per sample

rBatt commented 8 years ago

I used non-centered distributions for alpha and beta. This means I can do alpha_raw ~ normal(0, 1), then I use alpha <- alpha_mu + alpha_sd*alpha_raw as I had previously used alpha. I was very surprised and pleased with the speed up.

Oh, I also decreased step size to 0.01. This helped a bit, too.

> # =========================
> # = Timing and Efficiency =
> # =========================
> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
          warmup   sample         
chain:1 0.698877 0.676139 1.375016
chain:2 1.025372 1.382666 2.408038
chain:3 3.863502 1.006282 4.869784
chain:4 2.263946 1.204742 3.468688
> (max_time <- max(timing))
[1] 4.869784
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 66.93933
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 37.45098
> max_time/neff_mu
[1] 0.07274922

4.9 sec 0.07 sec per sample

Note: I think the timings are being slowed down by the intense usage of my processor as I run the full-EBS model (it's taking ~1.5 days so far, and I want to let it finish). When i run this non-centered version with only 1 chain and 1 core, it takes ~1 second.

rBatt commented 8 years ago

Making Omega hierarchical was really slow (~1 order of magnitude), because now had the hyperparameters, too. The only difference is that the hierarchy doesn't specify what the hyperdistribution is, b/c its parameters are fit instead.

rBatt commented 8 years ago

I noticed that I made an error in my most recent version of the model, after aggregating the presence process to the site level (still incrementing log_inv_logit(logit_psi) for each sample, instead of just 1 time per site).

rBatt commented 8 years ago

I aggregated the covariates among samples within a site-year, even for the detection covariates. I then modelled the covariates as random variables, using the sample mean and standard deviations. This made it easy to fix errors I had before (statistical, not programming) where I was iterating over the term lil_lp (which is log_inv_logit(logit_psi)) for each sample ... yikes! Presence is only defined at the site-year level in this model (this error was due to forgetting to fix this formulation from back when I had presence being modelled down to the sample level).

All said, this was a huge speed-up.

> (timing <- cbind(get_elapsed_time(ebs_msom), rowSums(get_elapsed_time(ebs_msom))))
          warmup   sample         
chain:1 0.272825 0.198744 0.471569
chain:2 0.401100 0.259313 0.660413
chain:3 1.533286 0.451326 1.984612
chain:4 0.296010 0.412060 0.708070
> (max_time <- max(timing))
[1] 1.984612
> (neff_mu <- mean(summary(ebs_msom)$summary[,"n_eff"]))
[1] 77.53699
> (neff_sd <- sd(summary(ebs_msom)$summary[,"n_eff"]))
[1] 33.56467
> max_time/neff_mu
[1] 0.02559568

1.98 sec 0.03 sec per sample

Note: Like in previous timing, I think this set of stats might be misrepresenting the actual speed of this model. If I run 1 chain on 1 core, it goes in 0.48 seconds . Also, it should be noted that I have suspicions that this model will scale really well, especially for situations with large numbers of species (likelihoods are vectorized over species) or large numbers of samples (K; using binomial for detection, so this can be thought of as being really efficient at handling larger K ... almost trivial to increase number of hauls). I think this particular model will scale most poorly with site (J; there is some matrix algebra with the rows of the matrices being size J, so that could get slow if J gets huge, and also have to loop over J), and will scale pretty linearly with increased years (T; but the slope of this line will probably be less than 1).

rBatt commented 8 years ago

OK, I think the efficiency issue has been addressed. Everything to play with now is basically just model structure, which could effect the speed, but it outside the scope of this issue. I think the low-hanging fruit of optimization has been harvested.

rBatt / trawl

Improve Stan MSOM Efficiency #99

Summary