Doing the decomposition for every momentum draw is inefficient

stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.

https://mc-stan.org

BSD 3-Clause "New" or "Revised" License

2.56k stars 365 forks source link

Doing the decomposition for every momentum draw is inefficient #2881

Open bbbales2 opened 4 years ago

bbbales2 commented 4 years ago

Summary:

For N parameters, there's an NxN Cholesky computed every time a new momentum is drawn. In reality we only need to recompute that when the metric changes.

Description:

Check the code here: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/hamiltonians/dense_e_metric.hpp#L54

That llt() only needs computed when z.inv_e_metric_ changes, not every time sample_p is called.

Reproducible Steps:

Sampling a simple model with a large number of parameters should be sufficient.

parameters {
  real x[500];
}

model {
  x ~ normal(0, 1);
}

Should do the trick. Run that model with:

./test sample num_warmup=0 adapt engaged=0 algorithm=hmc metric=dense_e

And compare the time with:

./test sample num_warmup=0 adapt engaged=0 algorithm=hmc metric=diag_e

And the difference should be noticeable once that cholesky is precomputed.

Current Output:

Output is fine, just slow.

Expected Output:

Same output.

Current Version:

v2.21.0

bob-carpenter commented 4 years ago

Yikes. We had explicitly discussed coding it originally so it didn't have this inefficiency. I guess that didn't make it to the code.

SteveBronder commented 4 years ago

Should it only happen when set_metric is called? Could also move the metric to a protected or private member so it's not accessed outside of set_metric

bbbales2 commented 4 years ago

Should it only happen when set_metric is called? Could also move the metric to a protected or private member so it's not accessed outside of set_metric

Yes that makes sense. Right now it's accessed directly by other things: https://github.com/stan-dev/stan/blob/develop/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp#L32, but an accessor makes more sense.

yizhang-yiz commented 4 years ago

Is a fix under way? This slows things down quite a bit when I use Radon model for testing.

betanalpha commented 4 years ago

The current design is intentional.

The Cholesky is needed only once per transition which is a relatively small cost compared to the many gradient evaluations needed within each transition. Saving the Cholesky decomposition introduces an additional $\mathcal{O}(N^{2})$ memory burden which is much more than the $mathcal{O}(N)$ burden everywhere else, and becomes problematic for sufficiently large models. Without any explicit profiling demonstrating that the Cholesky is a substantial const for typical models I don't see any strong motivation for the change.

I am inclined to close this issue until less anecdotal evidence of hotspots in the current code is demonstrated.

SteveBronder commented 4 years ago

@bbbales2 would you mind running some benchmarks to see if #2894 makes things faster/slower for a large/small model?

bbbales2 commented 4 years ago

Running the example code included at the top of this issue:

diagonal (pull #2894): 0.26s dense (pull #2894): 0.65s

diagonal (develop): 0.25s dense (develop): 3.5s

bob-carpenter commented 4 years ago

That's a huge speedup for dense matrices, which is where we're already paying an O(N^2) memory penalty just for storing the dense metric.

Is there a big memory penalty for the diagonal case? That shouldn't need a Cholesky decomposition.

bbbales2 commented 4 years ago

Is there a big memory penalty for the diagonal case?

I just put it there for symmetry and it was easy to get. We changed how the rng code is written but that's just a syntax thing: https://github.com/stan-dev/stan/pull/2894/files#diff-6a891b298c8e18322df7f82a1a362732R48

Edit: Oh I missed the question sorry. This uses no extra memory for the diagonal case.

SteveBronder commented 4 years ago

Nice! @betanalpha you cool with that?

betanalpha commented 4 years ago

Unfortunately I am not as this example isn't relevant to the cases where a dense metric would be necessary. In particular the density, and hence gradient, is artificially cheap due to the assumed independence and the number of leapfrog steps are relatively sparse which makes the overhead look more important than it is. A dense metric is useful when the target density has global correlations, and those typically cost at least O(N^{2}) not to mention requiring more leapfrog steps per numerical trajectory.

At the very least I would want to see how the overhead behaves for a correlated normal and student t, say with Sigma_{ij} = rho^{|i - j|}, for a few choices of rho like 0.25, 0.5, and 0.75 and a span of dimensions, say 50, 100, and 500. That wouldn't be exhaustive by any means but it would provide a much better picture of what a more realistic overhead would be.

Although it limits the practicality of testing, the choice to recompute was made with much higher dimensional models in mind, i.e. tens of thousands and hundreds of thousands. It's at those larger models where the memory starts to become problematic.

bbbales2 commented 4 years ago

Yeah I was trying to pick an example that exaggerated the differences.

Adaptation is turned off, but it's for a problem where the default will be fine. Just directly comparing the costs of doing the ideal thing (diagonal) to the expensive thing (dense).

I hit this problem originally with the same model Yi did: https://github.com/bbbales2/cmdstan-warmup/tree/develop/examples/radon . It's just a super simple model and it makes sense to run it with diag but it's artificially slow if you run it with dense cause of this problem.

much higher dimensional models in mind

That's where you'd definitely only want to pre-compute the Cholesky though. If N is number of parameters, Cholesky is O(N^3) right? So even if the memory is O(N^2) that's fine -- you're paying that cost to have the metric to begin with.