Closed mitzimorris closed 3 years ago
With 2.28 being released next week (assuming the release candidate is good to go), do we want to push this into 1.0 or should it wait for a 1.1?
push into 1.0 if possible, but then tracking progress becomes somewhat more complicated. doable, though.
We can look at this after #461 is where we want it to be and 2.28 is out/on conda. We can always decide at that point to punt it until later
I think we should start brainstorming how the interface of this looks and how we will implement it. The first question I have is how do we handle STAN_THREADS - currently our installation/building don't touch make/local at all. It would be great if we could use the rebuild command to both enable and (probably less useful but good for consistency) disable this from within python. It requires a little bit of trickery if we want it to be robust even if the user edits the file.
updating this to correct misconceptions:
CmdStanModel object instantiation choices:
info
method on model - if that succeeds and STAN_THREADS=true
, record this on the model object - if so, new property or add to cpp_options
?compile
method do?sample
method choices:
NUM_CHAINS
whenever possible because of increased memory efficiency?sample
method argument names, and instead handle the different cases correctly?original comment:
The first question I have is how do we handle STAN_THREADS
My first question as well, but on further reflection, I think that we shouldn't try to tie these things together.
In CmdStan, NUM_CHAINS can also be used to run multiple chains (sequentially), without the need for shell scripting, which is what the CmdStanPy sample
method arg chain
gives you (although with option to run chains in parallel).
The CmdStan guide recommends using make/local
- https://mc-stan.org/docs/2_28/cmdstan-guide/parallelization.html#compiling
we recommend writing the flag to the make/local file.
I think this recommendation is here because CmdStan is a very simple interface, thus this will provide consistency across compilation and runs for a local installation (and save typing!).
In CmdStanPy, the model object can track this via property https://mc-stan.org/cmdstanpy/api.html#cmdstanpy.CmdStanModel.cpp_options, so I don't think we should worry about make/local
.
In CmdStanPy, if a model is compiled during one session, during that session the cpp_options
property of the CmdStanModel object will reflect whether or not STAN_THREADS
was specified at compile time. In all other cases, we don't know, nor does make/local
help, because we don't know what state that file was in when the program was compiled.
@WardBrian @jgabry @rok-cesnovar - thoughts? discuss tomorrow?
I was under the impression that STAN_THREADS
had to be set at cmdstan build time, not just model compile time. Is that correct?
I’ve talked with @wds15 about this and he said that for singlethreaded performance, STAN_THREADS as minimal impact on MacOS/Linux but up to a 20% hit on Windows depending. Setting it true by default is therefore a difficult call, IMO. For Conda builds, I have decided not to set it by default at the moment
I was under the impression that STAN_THREADS had to be set at cmdstan build time, not just model compile time. Is that correct?
if so, this isn't properly addressed in the 2.28 CmdStan Guide. I wrote the above comments based solely on what's in the CmdStan Guide, thinking that this would reflect how things finally landed. (yes, lazy. about to test on my machine).
@Stevebronder @wds15 can you weigh in? I’m not sure where I got this impression
I was under the impression that STAN_THREADS had to be set at cmdstan build time, not just model compile time. Is that correct?
This has not been the case since I believe 2.25 or 2.26. After https://github.com/stan-dev/cmdstan/pull/882
Essentially, the first time STAN_THREADS is used with a given Cmdstan install, the main_threads.o
is compiled, which takes a bit longer. But after that its fine, the same as any compilation.
In all other cases, we don't know, nor does make/local help, because we don't know what state that file was in when the program was compiled.
You can now by calling model info
. See https://github.com/stan-dev/cmdstan/pull/1010 (this is already part of 2.27). Will respond on the arguments and proposal a bit later or tomorrow, have to think about it a bit more.
what about the pre-compiled header file? cf: https://github.com/stan-dev/cmdstan/blob/e48eae9463ac7a90baee18cfe46ecf5c8491674f/makefile#L124-L125
so when model is compiled with STAN_THREADS=true
, Makefile logic requires gch file named "stan/src/stan/model/model_header_threads.hpp.gch", and if that doesn't exist, will build it?
update: answered my own question - it all just works ! we don't have to worry about make/local
.
make STAN_THREADS=true build
- that created header file model_header_threads.hpp.gch
make examples/bernoulli/bernoulli
- that created header file model_header.hpp.gch
I have updated the issue description with all the details from above discussion.
my proposal is this: when the sample
method is called, if we can determine that STAN_THREADS=true
, then we run single-process multi-threaded using num_chains
and num_threads
.
this means that we can leave all arguments to the sample
method as they already are. under the hood, dealing with the logic is a bit of a PITA, but it always was.
the remaining question is whether or not to compile with STAN_THREADS=true
by default. if this doesn't negatively affect performance on any platform, then we could.
the biggest win with this feature is for running models on large datasets; we should call this out in the documentation.
I think doing it automatically is a good default setting, but we should still provide an argument to explicitly enable,disable this logic (something like stan_threaded=None, with None
meaning do the logic, true meaning assume it was compiled with the threads option, false meaning treat as if it wasn't)
Compiling with STAN_THREADS on by default can have a performance impact on Windows. Here's what @wds15 told me:
Our tests were done for single-core, single-chain performance tests. Under these circumstances I would say that turning on threading for macos and Linux is really ok. For Windows we look at an up to 20% performance loss. This could have changed by the new g++ 8 from the new rtools and this could look different due to now having to consider running 4 chains in parallel.
Note that Conda ships an older version of g++ than version 8. Without testing on the default settings (e.g. 4 chains) it's difficult to say.
when the sample method is called, if we can determine that STAN_THREADS=true, then we run single-process multi-threaded using num_chains and num_threads.
+1 to this. Just pointing out that this currently works for NUTS + diag_e or dense_e not all sampling.
the remaining question is whether or not to compile with STAN_THREADS=true by default.
I am opposed to going threaded by default. Last time anyone did any performance testing on this it had an effect on non-Mac systems. On Windows it was bad.
But regardless if there is a negative effect or not, this should not be done in CmdStan wrappers or CmdStan even. If there is no penalty for threading, we should enable it in Math by default and get rid of the #ifdef-ed C++ code. If there is a penalty, we should probably also not default to it.
the biggest win with this feature is for running models on large datasets; we should call this out in the documentation.
+1
I have a question about naming though. I hate naming things with a passion, but alas we will have to do it :)
The issue comes from the fact that this now changes how resources are divided. Previously we had parallel_chains
and threads_per_chain
. Parallel chains meant how many chains will try to run in parallel and threads per chain the maximum number of threads used for a chain.
With num_threads
and single-process-multichain though, we specify the total number of available threads for both chains and within-chain stuff (maximum number of threads available - not all will always be used though). num_threads=4
might mean 4 chains will run 4 threads with 1 thread per chain, or all four threads will be scheduled for one chain if it has a reduce_sum call. All it left to the TBB scheduler to decide.
Do we just say num_threads=parallel_chains*threads_per_chain
for now? Do we expose a new threads
argument, but what to do with the current ones then?
I hate naming things with a passion, but alas we will have to do it :)
agreed
Do we just say
num_threads=parallel_chains*threads_per_chain
for now?
this is my preference.
I think doing it automatically is a good default setting, but we should still provide an argument to explicitly enable,disable this logic (something like stan_threaded=None, with
None
meaning do the logic, true meaning assume it was compiled with the threads option, false meaning treat as if it wasn't)
this seems like a good idea. again, the name issue is seriously problematic. for the wrapper interfaces, the difference is between spawning a single subprocess or per-chain subprocess. therefore: per-chain-process
?
because if the model uses map_rect
or reduce_sum
it would still be threaded.
Interface:
force_one_process_per_chain
is a boolean flag for sample()
with a default of None
/Null
. N[one|ull] means we will use info
to determine if STAN_THREADS was enabled. If it is false, always assume STAN_THREADS/2.28. If it is true, do current (pre-2.28) behavior always.
Docs:
Vinegette/notebook showing how to use both ways, explaining how you can pass STAN_THREADS
in cpp_opts
or put it in make/local
to enable always.
Parity:
The files that we output should match how cmdstan does it, even if we are manually running multiple chains
@WardBrian - just pushed branch https://github.com/stan-dev/cmdstanpy/compare/feature/436-num-chains?expand=1
Summary:
Update both the CmdStanPy APIs and documentation to allow users to use CmdStan's new single-process multi-chain processing, introduced in 2.28.
As of 2.28, CmdStan provides two kinds of parallel processing:
reduce_sum
andmap_rect
(since 2.23 and 2.18, respectively)The CmdStan arguments
num_chains
andnum_threads
control the amount of parallelization.num_chains
specifies the number of chains, equivalent to CmdStanX sampler argchains
num_threads
specifies the total number of threads available to the TBB scheduler across all chains.The CmdStanX interfaces use 3 arguments altogether:
chains
, andparallel_chains
,threads_per_chain
, which can be mapped intonum_chains
andnum_threads
.However we need to decide what the default should be w/r/t when single-process multi-chain CmdStan is used/ not used and whether or not we need to add a new argument to the
compile
andsample
commands.Description:
All threading is handled by the TBB scheduler. In order to take advantage of multi-threading, models must be compiled with C++ compiler option
STAN_THREADS=true
.Since CmdStan 2.27, the compiled executable has option
info
which will report whether or not it was compiled withSTAN_THREADS=true
. See stan-dev/cmdstan#1010The CmdStan Guide provides the following documentation for
num_chains
andnum_threads
Prior to 2.28, in order to run multiple chains (highly recommended) directly in CmdStan, users were directed to use bash shell
for
loops, or else use any of CmdStanR, CmdStanPy, etc. interfaces.From the current CmdStanPy documentation:
CmdStanPy's
sample
method arguments:Additional Information:
see discussion in issue filed on CmdStanR: https://github.com/stan-dev/cmdstanr/issues/534
Current Version: