stan-dev / rstan

RStan, the R interface to Stan
https://mc-stan.org
1.04k stars 269 forks source link

Feature: Load balancing multi-core runs with more chains than cores #246

Closed davidmanheim closed 8 years ago

davidmanheim commented 8 years ago

Problem Description: When I run, say, 30 chains with 5 cores, Rstan seems to pre-allocate the chains to the cores. This means, for instance, that if chain 1 is really slow, and the others are fast, I can complete all chains 6-30, and have chains 2-5 still waiting to run even after the other chains are completed. This seems to be a bad behavior for some (read:my specific) uses.

Details: It's buried deep in the code, inside of how the stanmodel does setMethod("sampling"), where it uses ParLapply, instead of ParLapplyLB, on line 275. The load balanced version ("...LB") does not create reproducible results, which is an issue, but for some applications, that is better than not load balancing.

I can fork and put in a pull request to make this single line change, but I assume the preference is for this to be a user-set option, not just a switch to a non-reproducible way of running multicores, and that requires a bit more extensive changes.

bob-carpenter commented 8 years ago

May I ask why you're running 30 chains?

If there's a way that makes it non-reproducible, there should probably be a flag to keep the current behavior.

On Dec 21, 2015, at 5:48 PM, davidmanheim notifications@github.com wrote:

Problem Description: When I run, say, 30 chains with 5 cores, Rstan seems to pre-allocate the chains to the cores This means, for instance, that if chain 1 is really slow, and the others are fast, I can complete all chains 6-30, and have chains 2-5 still waiting to run even after the other chains are completed This seems to be a bad behavior for some (read:my specific) uses

Details: It's buried deep in the code, inside of how the stanmodel does setMethod("sampling"), where it uses ParLapply, instead of ParLapplyLB, on line 275 The load balanced version ("LB") does not create reproducible results, which is an issue, but for some applications, that is better than not load balancing

I can fork and put in a pull request to make this single line change, but I assume the preference is for this to be a user-set option, not just a switch to a non-reproducible way of running multicores, and that requires a bit more extensive changes

— Reply to this email directly or view it on GitHub.

davidmanheim commented 8 years ago

On the change, I think having the default remain reproducible would be good, but allow load balancing as an option. It could be valuable for running 10 chains on 2 cores just as much as my more extreme case.

Re: why, I was hoping to check whether there were any problems with the model.

Michael Betancourt pointed out in the forum that; "Markov chains are valid only when they are well-behaved from any starting point. Even one chain misbehaving diagnoses pathologies that affect the others — basically there may be a pathology region of parameter space but only one chain was lucky enough to find it. The more chains the more likely you are to identify such pathologies, which is why we recommend running as many chains as you can."

Since I have a decently large machine I could run a bunch of chains on over a weekend, I though that it would be a reasonable thing to try. (We can move this part of the discussion to the forum.)

betanalpha commented 8 years ago

Yeah, but if one chain is stalling then you already know that there is a problem and you don’t need the others to finish.

On Dec 22, 2015, at 9:10 AM, davidmanheim notifications@github.com wrote:

On the change, I think having the default remain reproducible would be good, but allow load balancing as an option. It could be valuable for running 10 chains on 2 cores just as much as my more extreme case.

Re: why, I was hoping to check whether there were any problems with the model.

Michael Betancourt pointed out in the forum that; "Markov chains are valid only when they are well-behaved from any starting point. Even one chain misbehaving diagnoses pathologies that affect the others — basically there may be a pathology region of parameter space but only one chain was lucky enough to find it. The more chains the more likely you are to identify such pathologies, which is why we recommend running as many chains as you can."

Since I have a decently large machine I could run a bunch of chains on over a weekend, I though that it would be a reasonable thing to try. (We can move this part of the discussion to the forum.)

— Reply to this email directly or view it on GitHub.

davidmanheim commented 8 years ago

Hi Michael, That's true; I was thinking about slow, not stalled, which was happening; some chains go much faster than others.

(In fact, a later chain is now stalled, after 22 hours. I thought I fixed the model; I guess specifying a tighter prior or some initial values may help.)

bgoodri commented 8 years ago

I think we can do this because all the PRNG is done either in the parent R process before the chains are executed or by Stan in the child processes, each of which has its own PRNG. But that is not going to help much if a chain gets stuck.