[FR] Use new num_chains arg in cmdstan

mitzimorris commented 3 years ago

Summary:

Update both the CmdStanPy APIs and documentation to allow users to use CmdStan's new single-process multi-chain processing, introduced in 2.28.

As of 2.28, CmdStan provides two kinds of parallel processing:

Running multiple sampler chains in a single CmdStan process (introduced in 2.28),
Within-chain parallelization using Stan functions reduce_sum and map_rect (since 2.23 and 2.18, respectively)

The CmdStan arguments num_chains and num_threads control the amount of parallelization.

num_chains specifies the number of chains, equivalent to CmdStanX sampler arg chains
num_threads specifies the total number of threads available to the TBB scheduler across all chains.

The CmdStanX interfaces use 3 arguments altogether: chains, and parallel_chains, threads_per_chain, which can be mapped into num_chains and num_threads.

However we need to decide what the default should be w/r/t when single-process multi-chain CmdStan is used/ not used and whether or not we need to add a new argument to the compile and sample commands.

Description:

All threading is handled by the TBB scheduler. In order to take advantage of multi-threading, models must be compiled with C++ compiler option STAN_THREADS=true.

Since CmdStan 2.27, the compiled executable has option info which will report whether or not it was compiled with STAN_THREADS=true. See stan-dev/cmdstan#1010

The CmdStan Guide provides the following documentation fornum_chains and num_threads

Prior to 2.28, in order to run multiple chains (highly recommended) directly in CmdStan, users were directed to use bash shell for loops, or else use any of CmdStanR, CmdStanPy, etc. interfaces.

From the current CmdStanPy documentation:

In order to evaluate the fit of the model to the data, it is necessary to run several Monte Carlo chains and compare the set of draws returned by each. By default, the sample command runs 4 sampler chains, i.e., CmdStanPy invokes CmdStan 4 times. CmdStanPy uses Python’s subprocess and multiprocessing libraries to run these chains in separate processes. This processing can be done in parallel, up to the number of processor cores available.

CmdStanPy's sample method arguments:

chains (Optional[int]) – Number of sampler chains, must be a positive integer.
parallel_chains (Optional[int]) – Number of processes to run in parallel. Must be a positive integer. Defaults to multiprocessing.cpu_count().
threads_per_chain (Optional[int]) – The number of threads to use in parallelized sections within an MCMC chain (e.g., when using the Stan functions reduce_sum() or map_rect()). This will only have an effect if the model was compiled with threading support. The total number of threads used will be parallel_chains * threads_per_chain.

Additional Information:

see discussion in issue filed on CmdStanR: https://github.com/stan-dev/cmdstanr/issues/534

Current Version:

WardBrian commented 3 years ago

With 2.28 being released next week (assuming the release candidate is good to go), do we want to push this into 1.0 or should it wait for a 1.1?

mitzimorris commented 3 years ago

push into 1.0 if possible, but then tracking progress becomes somewhat more complicated. doable, though.

WardBrian commented 3 years ago

We can look at this after #461 is where we want it to be and 2.28 is out/on conda. We can always decide at that point to punt it until later

WardBrian commented 3 years ago

I think we should start brainstorming how the interface of this looks and how we will implement it. The first question I have is how do we handle STAN_THREADS - currently our installation/building don't touch make/local at all. It would be great if we could use the rebuild command to both enable and (probably less useful but good for consistency) disable this from within python. It requires a little bit of trickery if we want it to be robust even if the user edits the file.

mitzimorris commented 3 years ago

updating this to correct misconceptions:

we don't need to worry about STAN_THREADS

CmdStanModel object instantiation choices:

if exe only or exe newer than Stan file, try to call info method on model - if that succeeds and STAN_THREADS=true, record this on the model object - if so, new property or add to cpp_options?
if no exe file, what should compile method do?

sample method choices:

should we try to use NUM_CHAINS whenever possible because of increased memory efficiency?
can we avoid changing sample method argument names, and instead handle the different cases correctly?

original comment:

The first question I have is how do we handle STAN_THREADS

My first question as well, but on further reflection, I think that we shouldn't try to tie these things together.

In CmdStan, NUM_CHAINS can also be used to run multiple chains (sequentially), without the need for shell scripting, which is what the CmdStanPy sample method arg chain gives you (although with option to run chains in parallel).

The CmdStan guide recommends using make/local - https://mc-stan.org/docs/2_28/cmdstan-guide/parallelization.html#compiling

we recommend writing the flag to the make/local file.

I think this recommendation is here because CmdStan is a very simple interface, thus this will provide consistency across compilation and runs for a local installation (and save typing!).

In CmdStanPy, the model object can track this via property https://mc-stan.org/cmdstanpy/api.html#cmdstanpy.CmdStanModel.cpp_options, so I don't think we should worry about make/local.

In CmdStanPy, if a model is compiled during one session, during that session the cpp_options property of the CmdStanModel object will reflect whether or not STAN_THREADS was specified at compile time. In all other cases, we don't know, nor does make/local help, because we don't know what state that file was in when the program was compiled.

@WardBrian @jgabry @rok-cesnovar - thoughts? discuss tomorrow?

WardBrian commented 3 years ago

I was under the impression that STAN_THREADS had to be set at cmdstan build time, not just model compile time. Is that correct?

I’ve talked with @wds15 about this and he said that for singlethreaded performance, STAN_THREADS as minimal impact on MacOS/Linux but up to a 20% hit on Windows depending. Setting it true by default is therefore a difficult call, IMO. For Conda builds, I have decided not to set it by default at the moment

mitzimorris commented 3 years ago

I was under the impression that STAN_THREADS had to be set at cmdstan build time, not just model compile time. Is that correct?

if so, this isn't properly addressed in the 2.28 CmdStan Guide. I wrote the above comments based solely on what's in the CmdStan Guide, thinking that this would reflect how things finally landed. (yes, lazy. about to test on my machine).

WardBrian commented 3 years ago

@Stevebronder @wds15 can you weigh in? I’m not sure where I got this impression

rok-cesnovar commented 3 years ago

I was under the impression that STAN_THREADS had to be set at cmdstan build time, not just model compile time. Is that correct?

This has not been the case since I believe 2.25 or 2.26. After https://github.com/stan-dev/cmdstan/pull/882 Essentially, the first time STAN_THREADS is used with a given Cmdstan install, the main_threads.o is compiled, which takes a bit longer. But after that its fine, the same as any compilation.

In all other cases, we don't know, nor does make/local help, because we don't know what state that file was in when the program was compiled.

You can now by calling model info. See https://github.com/stan-dev/cmdstan/pull/1010 (this is already part of 2.27). Will respond on the arguments and proposal a bit later or tomorrow, have to think about it a bit more.

mitzimorris commented 3 years ago

what about the pre-compiled header file? cf: https://github.com/stan-dev/cmdstan/blob/e48eae9463ac7a90baee18cfe46ecf5c8491674f/makefile#L124-L125

rok-cesnovar commented 3 years ago

Same thing: https://github.com/stan-dev/cmdstan/blob/develop/makefile#L125

mitzimorris commented 3 years ago

so when model is compiled with STAN_THREADS=true, Makefile logic requires gch file named "stan/src/stan/model/model_header_threads.hpp.gch", and if that doesn't exist, will build it?

update: answered my own question - it all just works ! we don't have to worry about make/local.

ran make STAN_THREADS=true build - that created header file model_header_threads.hpp.gch
ran make examples/bernoulli/bernoulli - that created header file model_header.hpp.gch

mitzimorris commented 3 years ago

I have updated the issue description with all the details from above discussion.

my proposal is this: when the sample method is called, if we can determine that STAN_THREADS=true, then we run single-process multi-threaded using num_chains and num_threads.

this means that we can leave all arguments to the sample method as they already are. under the hood, dealing with the logic is a bit of a PITA, but it always was.

the remaining question is whether or not to compile with STAN_THREADS=true by default. if this doesn't negatively affect performance on any platform, then we could.

the biggest win with this feature is for running models on large datasets; we should call this out in the documentation.

WardBrian commented 3 years ago

I think doing it automatically is a good default setting, but we should still provide an argument to explicitly enable,disable this logic (something like stan_threaded=None, with None meaning do the logic, true meaning assume it was compiled with the threads option, false meaning treat as if it wasn't)

WardBrian commented 3 years ago

Compiling with STAN_THREADS on by default can have a performance impact on Windows. Here's what @wds15 told me:

Our tests were done for single-core, single-chain performance tests. Under these circumstances I would say that turning on threading for macos and Linux is really ok. For Windows we look at an up to 20% performance loss. This could have changed by the new g++ 8 from the new rtools and this could look different due to now having to consider running 4 chains in parallel.

Note that Conda ships an older version of g++ than version 8. Without testing on the default settings (e.g. 4 chains) it's difficult to say.

rok-cesnovar commented 3 years ago

when the sample method is called, if we can determine that STAN_THREADS=true, then we run single-process multi-threaded using num_chains and num_threads.

+1 to this. Just pointing out that this currently works for NUTS + diag_e or dense_e not all sampling.

the remaining question is whether or not to compile with STAN_THREADS=true by default.

I am opposed to going threaded by default. Last time anyone did any performance testing on this it had an effect on non-Mac systems. On Windows it was bad.

But regardless if there is a negative effect or not, this should not be done in CmdStan wrappers or CmdStan even. If there is no penalty for threading, we should enable it in Math by default and get rid of the #ifdef-ed C++ code. If there is a penalty, we should probably also not default to it.

the biggest win with this feature is for running models on large datasets; we should call this out in the documentation.

+1

rok-cesnovar commented 3 years ago

I have a question about naming though. I hate naming things with a passion, but alas we will have to do it :)

The issue comes from the fact that this now changes how resources are divided. Previously we had parallel_chains and threads_per_chain. Parallel chains meant how many chains will try to run in parallel and threads per chain the maximum number of threads used for a chain.

With num_threads and single-process-multichain though, we specify the total number of available threads for both chains and within-chain stuff (maximum number of threads available - not all will always be used though). num_threads=4 might mean 4 chains will run 4 threads with 1 thread per chain, or all four threads will be scheduled for one chain if it has a reduce_sum call. All it left to the TBB scheduler to decide.

Do we just say num_threads=parallel_chains*threads_per_chain for now? Do we expose a new threads argument, but what to do with the current ones then?

mitzimorris commented 3 years ago

I hate naming things with a passion, but alas we will have to do it :)

agreed

mitzimorris commented 3 years ago

Do we just say num_threads=parallel_chains*threads_per_chain for now?

this is my preference.

mitzimorris commented 3 years ago

I think doing it automatically is a good default setting, but we should still provide an argument to explicitly enable,disable this logic (something like stan_threaded=None, with None meaning do the logic, true meaning assume it was compiled with the threads option, false meaning treat as if it wasn't)

this seems like a good idea. again, the name issue is seriously problematic. for the wrapper interfaces, the difference is between spawning a single subprocess or per-chain subprocess. therefore: per-chain-process? because if the model uses map_rect or reduce_sum it would still be threaded.

WardBrian commented 3 years ago

Interface:

force_one_process_per_chain is a boolean flag for sample() with a default of None/Null. N[one|ull] means we will use info to determine if STAN_THREADS was enabled. If it is false, always assume STAN_THREADS/2.28. If it is true, do current (pre-2.28) behavior always.

Docs:

Vinegette/notebook showing how to use both ways, explaining how you can pass STAN_THREADS in cpp_opts or put it in make/local to enable always.

Parity:

The files that we output should match how cmdstan does it, even if we are manually running multiple chains

mitzimorris commented 3 years ago

@WardBrian - just pushed branch https://github.com/stan-dev/cmdstanpy/compare/feature/436-num-chains?expand=1

stan-dev / cmdstanpy