stan-dev / stan

Stan development repository. The master branch contains the current release. The develop branch contains the latest stable development. See the Developer Process Wiki for details.
https://mc-stan.org
BSD 3-Clause "New" or "Revised" License
2.59k stars 368 forks source link

Speculative dynamic HMC #2818

Open wds15 opened 5 years ago

wds15 commented 5 years ago

Summary:

The dynamic HMC algorithm can be parallelized to run the forward and backward sweeps in independence. See for a description and a decent evaluation the thread on discourse:

https://discourse.mc-stan.org/t/parallel-dynamic-hmc-merits/10895

By using 2 cores we can be speed up things by 35%.

Description:

A prototype implementation is here:

https://github.com/stan-dev/stan/blob/parallel-nuts-2/src/stan/mcmc/hmc/nuts/base_nuts.hpp#L159

That version is based on the 2.20.0 sampler.

The branch is not runnable without a special setup of math due to the requirement of the TBB. A variant of the branch https://github.com/stan-dev/math/tree/parallel-ad-tape-3 can make this one run, but no attempt is made at the moment to give clear instructions.

This issue is here to document that we can get 35% speedup if we do a speculative HMC

Reproducible Steps:

See discourse for all of this.

Current Output:

Expected Output:

Faster samples.

Additional Information:

Provide any additional information here.

Current Version:

v2.20.0

yizhang-yiz commented 5 years ago

Is there a way to evaluate memory footprint of presampling to max treedepth?

wds15 commented 5 years ago

more memory is needed, for sure... don't know by heart.

SteveBronder commented 5 years ago

Already iterated this but It may be worth throwing together a branch of Stan that can use a bigger DGP of the cmdstan benchmarks + this so we can get an idea of how it works across a bunch of models

SteveBronder commented 5 years ago

Tangent to this but I liked Michaels suggestion of refactoring Stan a bit so that it can handle multiple chains and sharing data between them. That would also let us do weird things like adjust the tree depth and max delta of the model based on how other chains were doing

yizhang-yiz commented 5 years ago

Tangent to this but I liked Michaels suggestion of refactoring Stan a bit so that it can handle multiple chains and sharing data between them. That would also let us do weird things like adjust the tree depth and max delta of the model based on how other chains were doing

+1 to this. Could you point me to where he made the suggestion?

wds15 commented 5 years ago

He made the suggestion in the parallel HMC thread... and I do agree.

I do want to look into that - imagine you fire up 6 chains and pool the warmup info. That would speedup sampling a lot. The crux is that stan is not tailored at all to handle multiple chains at once. The one-chain assumption is very hard coded into the services (which is why speculative HMC is easier from this perspective since it speeds up one chain).

My plan is to write a version which fires of many chains for warmup, pools that info and then discards the info to run just one chain. If you do that, then you can basically do one run for warmup and then use the restarting facilities which we have to get what you really want (multiple chains starting from a common warmup). At least this is a first quick way to test the merits of all of this.

yizhang-yiz commented 5 years ago

I still don't understand get why you said using 4-cores in this setting is not possible. Could you elaborate?

SteveBronder commented 5 years ago

I think the biggest wugh with multiple chains in one stan program is that our upstream services assume there is one chain.

It would look a bit odd but essentially we would just want an 'algorithm' class that takes in N instances of a model then every few iterations pulls some meta info from them and sends some metainfo back

SteveBronder commented 5 years ago

I was thinking about that when I was reading over the hints paper, they seem to do something like that but with the temperature for their annealing scheme (it wasn't exactly annealing but similar)

SteveBronder commented 5 years ago

@yizhang-yiz see bottom of here

https://discourse.mc-stan.org/t/parallel-dynamic-hmc-merits/10895/13?u=stevo15025

wds15 commented 5 years ago

Why can’t we have a multi chain warmup which are terminated once done doing warmup. Then the CmdStan or whatever else interface fires up multiple chains with the so obtained adapted sampler. This way the services still handle only one chain at a time and it’s just the interface which makes this work...or some callback in Stan services itself.

We should find a way there.

SteveBronder commented 5 years ago

^I also like that!

wds15 commented 5 years ago

@yizhang-yiz Our dynamic HMC implementation goes randomly forward and backward in time. Right now that is done serially. The speculative HMC allows to go simultaneous forward and backward in time. Since there is only backward and forward that's it - no more cores can be spend on that. Makes sense?

bob-carpenter commented 5 years ago

I started a discussion on Discourse about how to evaluate parallelization of log density or MCMC algorithm.

https://discourse.mc-stan.org/t/evaluating-parallelization-performance/10962

bob-carpenter commented 5 years ago

On Sep 12, 2019, at 7:48 PM, wds15 notifications@github.com wrote:

He made the suggestion in the parallel HMC thread... and I do agree.

I do want to look into that - imagine you fire up 6 chains and pool the warmup info. That would speedup sampling a lot. The crux is that stan is not tailored at all to handle multiple chains at once. The one-chain assumption is very hard coded into the services (which is why speculative HMC is easier from this perspective since it speeds up one chain).

The services code up specific algorithms, but it'd be trivial to write new ones or extend the existing ones for multiple chains.

My plan is to write a version which fires of many chains for warmup, pools that info and then discards the info to run just one chain. If you do that, then you can basically do one run for warmup and then use the restarting facilities which we have to get what you really want (multiple chains starting from a common warmup). At least this is a first quick way to test the merits of all of this.

This would be great.

bob-carpenter commented 5 years ago

On Sep 12, 2019, at 11:30 PM, wds15 notifications@github.com wrote:

Why can’t we have a multi chain warmup which are terminated once done doing warmup. Then the CmdStan or whatever else interface fires up multiple chains with the so obtained adapted sampler. This way the services still handle only one chain at a time and it’s just the interface which makes this work...or some callback in Stan services itself.

We should find a way there.

Yes. This is exactly what I'm suggesting. If we can code it, I don't think writing a new service is going to be a bottleneck to launching it.

wds15 commented 5 years ago

And the coolest thing about the threaded warmup is that no changes to Stan-math reverse mode AD is needed. Everything will work with the existing infrastructure... though figuring out how to manage the TBB threads from Stan (instead of Stan-math) would be good, I think; let's see.

bob-carpenter commented 5 years ago

I think thread management will be super important if we're running multipl chains as one way is likely to consume half the resources or so of the other direction, and we don't want cores idling if we can find work for them to do.

wds15 commented 5 years ago

I agree. The TBB has the concept of task_arenas which is a way to fence work from one another. My intuition is that we should manage these task_arenas from within Stan and not from Stan-math.

yizhang-yiz commented 5 years ago

I think thread management will be super important if we're running multipl chains as one way is likely to consume half the resources or so of the other direction, and we don't want cores idling if we can find work for them to do.

If we are willing to move from threading to distributed(MPI) problems like this would be much easier to solve. For the warmup pooling concept, all we need is to run multiple chains and wait till convergence and collect where the samplers are.

bob-carpenter commented 5 years ago

All the chains can share information as they're all estimating global covariance. So ideally they'd share as they go along, because better estimates of covariance early on mean better sampling in the next stage. It's very simple aggregate sharing as all we're doing is estimating covariance from the collected draws. One issue is how to balance chains---by draw or by chain or somewhere in between?

On Sep 13, 2019, at 4:29 PM, Yi Zhang notifications@github.com wrote:

I think thread management will be super important if we're running multipl chains as one way is likely to consume half the resources or so of the other direction, and we don't want cores idling if we can find work for them to do.

If we are willing to move from threading to distributed(MPI) problems like this would be much easier to solve. For the warmup pooling concept, all we need is to run multiple chains and wait till convergence and collect where the samplers are.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 5 years ago

I will start simple - which means by chain. If you do by draw then you run instantly into exact reprocubility hiccups unless you sort the data as you accumulate which is a pain.

So with KIS in mind (keep it simple) I start with by-chain.

yizhang-yiz commented 5 years ago

Our dynamic HMC implementation goes randomly forward and backward in time. Right now that is done serially. The speculative HMC allows to go simultaneous forward and backward in time. Since there is only backward and forward that's it - no more cores can be spend on that. Makes sense?

You keep talking about parallel HMC but actually mean NUTS. This is where I got confused.

betanalpha commented 5 years ago

I think it will be helpful to clarify some of the current structure of the Stan services. Right now all calls to HMC methods (both with static and dynamic integration times) run an entire chain, including warmup and the main sampling iterations. Any sharing of information before the chains have completed will require modifying this pattern.

The most straightforward approach, and the one that I recommend, is to modify the main API routes to take in a num_chains argument that handles the running of multiple chains. First in serial, then in thread parallel. Once that has all been piped through we can consider sharing information at various stages of the Markov chain construction.

Given that the services will have to change one way or another this would be worth starting with some initial design specs to ensure no conflicts with future plans for the services.

bob-carpenter commented 5 years ago

I agree that this is what we need to do going forward. It shouldn't be hard from the services side. The harder part's going to be evaluating alternative adaptation schemes. We really want to be monitoring adaptation for convergence here.

On Sep 17, 2019, at 3:12 AM, Michael Betancourt notifications@github.com wrote:

I think it will be helpful to clarify some of the current structure of the Stan services. Right now all calls to HMC methods (both with static and dynamic integration times) run an entire chain, including warmup and the main sampling iterations. Any sharing of information before the chains have completed will require modifying this pattern.

The most straightforward approach, and the one that I recommend, is to modify the main API routes to take in a num_chains argument that handles the running of multiple chains. First in serial, then in thread parallel. Once that has all been piped through we can consider sharing information at various stages of the Markov chain construction.

Given that the services will have to change one way or another this would be worth starting with some initial design specs to ensure no conflicts with future plans for the services. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 5 years ago

... and we need to be careful to keep things deterministic... I don't want that the workload on the machine determines the exact numerical output. It's a non-trivial restriction in this setting as I find.

yizhang-yiz commented 5 years ago

Can we replace "speculative" with another term? Because to me in MCMC parallelization it has specific meaning of "prefetching", like the method proposed by Elaine Angelino et al. That method applies to HMC not just NUTS, so it's really speculative HMC, and it scales better(at least in theory).

bob-carpenter commented 5 years ago

This is a much bigger issue about requiring bit-level numerical reproducibility for every possible Stan configuration. Has that been decided as a top-level requirement for every Stan algorithm?

On Sep 18, 2019, at 5:52 PM, wds15 notifications@github.com wrote:

... and we need to be careful to keep things deterministic... I don't want that the workload on the machine determines the exact numerical output. It's a non-trivial restriction in this setting as I find.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

yizhang-yiz commented 5 years ago

bit-level numerical reproducibility for every possible Stan configuration

Is that even possible?

bob-carpenter commented 5 years ago

You have to keep hardware and versions of software fixed, so no, not in full generality. Java tried to go down the route of reproducible numerics, but soon learned their lesson and went back to IEEE floating point.

On Sep 18, 2019, at 6:14 PM, Yi Zhang notifications@github.com wrote:

bit-level numerical reproducibility for every possible Stan configuration

Is that even possible?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

wds15 commented 5 years ago

We have not yet decided that... but I think people have different expectations on that. I would prefer if things they exactly reproducible as a default. If we deviate from that, then should be doced and justified...and most importantly there must be a way to do things exactly reproducible.

yizhang-yiz commented 5 years ago

Even if hardware & software is fixed, when it comes to multi-threading determinism is hard. TBB specifically avoids promising that.

On Sep 18, 2019, at 9:54 AM, Bob Carpenter notifications@github.com wrote:

You have to keep hardware and versions of software fixed, so no, not in full generality. Java tried to go down the route of reproducible numerics, but soon learned their lesson and went back to IEEE floating point.

On Sep 18, 2019, at 6:14 PM, Yi Zhang notifications@github.com wrote:

bit-level numerical reproducibility for every possible Stan configuration

Is that even possible?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2818?email_source=notifications&email_token=AH4SDMBSQYWD3PA73FVUZADQKJMOFA5CNFSM4IWHOHG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7AXNXY#issuecomment-532772575, or mute the thread https://github.com/notifications/unsubscribe-auth/AH4SDMCH7WVIOROTA62HTKLQKJMOFANCNFSM4IWHOHGQ.

wds15 commented 5 years ago

I know...its very hard...

betanalpha commented 5 years ago

For interleaving chains this is straightforward, especially as the chains need carefully chosen seeds anyways. A call to a chain bundle route would handle setting sequential chain ids to each chain which then ensure independent PRNGs. Then all operations that share information between the chains are just parsed in the order of those chain ids. The worst thing that happens is that no shared calculations can be computed until all chains finish each segment, which is fine given how cheap the adaptation calculations are relative to the chain evolutions.

On Sep 18, 2019, at 1:45 PM, wds15 notifications@github.com wrote:

I know...its very hard...

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2818?email_source=notifications&email_token=AALU3FTQDHVHNS3MQM34CTLQKJSMLA5CNFSM4IWHOHG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7A4DPA#issuecomment-532791740, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FUBVXMDJHOHK3OBOBDQKJSMLANCNFSM4IWHOHGQ.

bob-carpenter commented 5 years ago

I don't think it's fine to have to synchronize by segment. The synch cost increase dramatically as every thread is waiting for the last thread to complete. Total waiting time of threads will increase as N * slowest-thread, where the slowest thread will grow as a rank N statistic. This wouldn't be so bad with small N or if there wasn't such high time variance per iteration of our current samplers.

wds15 commented 5 years ago

I agree that synchronisation should be avoided. My idea was to start with synchronisation every now and then; so maybe every 50 iterations or so - as I start I would just use the currently defined windows of adaptation.

We really need a prototype first before making decisions on this...BTW, we should probably move this discussion to discourse.

betanalpha commented 5 years ago

The only place where bit wise reproducibility is at risk is when computing summary statistics across chains which requires waiting for all of the chains anyways. I’m just saying that when computing those summary statistics there’s little benefit to aggregating partial computations while waiting for the other chains and so we might as well fully synchronize when we need to communicate at which point we can ensure bit wise reproducibility by enforcing a set order of the chains.

On Sep 19, 2019, at 6:57 AM, wds15 notifications@github.com wrote:

I agree that synchronisation should be avoided. My idea was to start with synchronisation every now and then; so maybe every 50 iterations or so - as I start I would just use the currently defined windows of adaptation.

We really need a prototype first before making decisions on this...BTW, we should probably move this discussion to discourse.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2818?email_source=notifications&email_token=AALU3FQG6KWAOT3UTCCYKQTQKNLH7A5CNFSM4IWHOHG2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD7DCHVQ#issuecomment-533078998, or mute the thread https://github.com/notifications/unsubscribe-auth/AALU3FR476VJCRTI5JGDNJTQKNLH7ANCNFSM4IWHOHGQ.

bob-carpenter commented 5 years ago

On Sep 19, 2019, at 4:47 PM, Michael Betancourt notifications@github.com wrote:

The only place where bit wise reproducibility is at risk

I'm imagining an adaptation algorithm with a lot of threads updating summary statistics for a joint covariance estimate. If you use asynchronous lock-free approaches like compare-and-swap, then you lose bit-level reproducibility.