Closed mitzimorris closed 5 years ago
A neat usage on top of this is continuing a chain. In HPC situations, a 24 hour wall time is typical and continuation of a chain would be enormously helpful (especially on slower CPUs like Xeon Phi).
continuing a chain once it's reached convergence?
This is where we need to watch out for argument dependencies. Even if no initial metric is specified, there needs to be a way to specify whether to use a diagonal or dense version. Then that specification needs to be consistent with the specification of the metric. The services pull these options into two different functions (dense_e and diag_e for dense and diagonal Euclidean).
The other issue with allowing the argument to be either is that it should naturally be a strictly positive, finite vector for the diagonal case and positive definite matrix for the dense case. Are these the same type in Python? If not, how do we deal with what's essentially a variant type argument?
One thing we could do is not have a separate argument to indicate diagonal or dense, but rather always require an initial metric and just give it a default value of a vector of all 1
s. Then if you want to get our current default for the dense case, the user would have to specify a unit matrix as the argument.
Then, I think we can get away without having adapt/no-adapt cases. This is important for restarting as @maedoc points out. The no-adapt cases will just use whatever the initial metric is (defaults to unit, but can be specified as diagonal or dense). The only thing this doesn't allow, is no adapt with warmup (throwing away initial segment of chains). That could perhaps be handled separately in post-processing rather than as part of the nuts function.
Just to be more clear, what I'm proposing is
a = vector of 1 values
b = unit matrix
a2 = user-supplied vector
b2 = user-supplied matrix
model.sample(metric = ..., ...)
where the metric could be any of these four things, with a default to the vector of 1s. Presumably there's an easy way in NumPy to create a unit matrix and a vector of 1 values so the user doesn't have to build it up manually.
Once adaptation parameters has converged (which presupposes arrival in the typical set in order to estimate the metric), we can run more sampling iterations from where we left off (position, metric, step size) to get larger effective sample sizes.
If adaptation hasn't converged, we can take the final state of the metric and step size after adaptation and restart adaptation at that point. Any sampling attempted when adaptation hasn't converged is likely to be unreliable, so we'd throw that out.
On May 17, 2019, at 7:49 AM, Mitzi Morris notifications@github.com wrote:
continuing a chain once it's reached convergence?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.
finite vector for the diagonal case and positive definite matrix for the dense case. Are these the same type in Python? If not, how do we deal with what's essentially a variant type argument?
NumPy arrays don't distinguish among those exception for 1d vs 2d, so some validation would be required for pos def, but defaults are straightforward (np.ones(N)
or np.identity(N)
).
Once adaptation parameters has converged
I was curious if the adaptation schedules in CmdStan are maintained in some resource file or hard coded in C++? It would be useful to reuse those schedules from Python e.g. for difficult problems or circumventing walltime restrictions.
the adaptation schedules are available as CmdStan parameters init_buffer
, window
, and term_buffer
- these got overlooked when I wrote up the spec - I don't know why, as they're clearly arguments that are passed all the way through to the NUTS sampler - my bad.
I was suggesting the following factoring of the sampling config (these are not proposals for names, just to identify the pieces):
warmup_iter: # warmup iterations sampling_iter: # sampling iterations (init_buffer_frac, window_frac, term_buffer_frac): values between 0 and 1 summing to less than 1.
Then we set, e.g.,
init_buffer = init_buffer_frac * warmup_iter
This is so that the arguments don't depend on each other. Otherwise, we'd have to make sure that init_buffer + window_frac + term_buffer_frac < warmup.
I think it makes sense to scale up the buffers and windows with warmup size, but I could see someone wanting to fix that initial fast buffer size or something rather than treating it as a proportion.
how about reworking the arguments to the sample
function as follows:
model
chains
- default 4 chains (current good practice recommendation)cores
- default 1random_seed
- (currently seed
) - allow single value or list of per-chain seedschain_ids
- list of positive integer idsdata
- either a Dict
or str
which is the name of a JSON or Rdump fileinit_params
- initial param values - allows types: Dict, float, str, or functionmax_treedepth
- currently treedepth
metric
- only allows values 'diag_e' or 'dense_e' - the equivalent of 'unit_e' requires supplied vector of 1s and no metric adaptationmetric_file
- either single filename or list of filenames - shape of metric values must match metric typestepsize
- either single stepsize or list of per-chain stepsizeswarmup_iter
- CmdStan arg "num_warmup"sampling_iter
- CmdStan arg "num_sample" - using term 'iter' instead of draw to avoid confusion with terms "sample" and "draw". by removing "thin" arg (i.e., thin=1), sampling iterations == num drawssave_warmup
(?) - when is this useful?adapt_engaged
- boolean - adaptation requires warmup_iters > 0 - under some circumstances may want adapt_engaged = False, warmup_iters > 0adapt_delta
- CmdStan "delta"adapt_init
- CmdStan arg "init_buffer" - phase I of adaptation -find typical set - adjusts stepsize (fast) (default 75 iters)adapt_metric
- CmdStan arg "window" - phase II of adaptation - iteratively tune metric and stepsize - default interval starts at 25 iters, keeps doubling (should be 75% of total warmups, i.e. at least 500 iterations. CmdStan guide diagram corresponds to 1000 warmup iters)adapt_stepsize
- CmdStan arg "term_buffer" - phase III of adaptation- re-adjust step-size (fast) (50)output_file
- string used to determine name of csv files as well as diagnostic filesrefresh
- better name? - "-1" == "silent" mode for batch processing - ideally use progressbar, spinner or otherwise monitor csv output fileI like the suggested argument names and also have some suggestions that might be interesting: 1) 'max_treedepth' could also be 'treedepth_cap'; 2) init_params seems overloaded, what about 'init_values' and 'init_generator' as separate default parameters (the latter could be the hypercube by default)?; 3) 'metric' could be 'metric_type' and we could call the options 'diagonal' and 'dense' because that seems to be what people call them (almost nobody says '_e' afaict); 4) I always run with 'save_warmup' so I can catch failing models before they waste too much time, it's not hard to diagnose failing models in the first hundred iterations, I think both this argument and thin should be the output handler's job. Ditto for refresh. I guess that's going against the "flat" thing but in my defense it could be a well documented default helper function...
So basically just suggestions, this looks great.
thanks @sakrejda! going to take suggestions 2 & 3. agreed that we need better output handling. I think for now we have to make do, (but I'm a pragmatical bastard).
Should these follow CmdStan keywords?
yes, as much as possible - which is an argument for metric
instead of metric_type
. also it's shorter.
it's a tradeoff - we want names that are terse, informative, and as much as possible consistent with what's already being used across interfaces.
the thing that always trips me up is the difference between RStan and CmdStan - CmdStan has num_warmup
and num_samples
(sic) and RStan has iter
which includes warmup and param warmup
use to override default allocation of 1/2 iter as warmup, sampling iterations respectively.
so CmdStanPy will go with CmdStan in that you don't specify total iterations, but with better names.
it's bad practice to be changing the API, but I got a bunch of things wrong and am trying to correct them now - your input most valued!
PyStan follows RStan API, but RStan3, PyStan3 (and CmdStan3) should follow the same API (which is very similar to CmdStan API).
so you're OK with warmup_iter
and sampling_iter
?
Yes, they sound fine.
Docstring should have a section listing all the possibilities (unless they are parameters, which means they should be described anyway.)
That is something I think is missing in CmdStan interface (fast docs for parameters; I usually need to google for the CmdStan doc pdf)
right - will add docstring.
regarding CmdStan and eventual CmdStan3 - plan is to go to flat argument structure - current argument parser is really confusing. there's an issue on CmdStan to convert the CmdStan guide from latex to bookdown and put it online - https://github.com/stan-dev/cmdstan/issues/657
On Jun 13, 2019, at 11:16 PM, Mitzi Morris notifications@github.com wrote:
how about reworking the arguments to the sample function as follows:
I would generally recommend the shortest clear name.
• model • chains - default 4 chains (current good practice recommendation) • cores - default 1 • random_seed - (currently seed) - allow single value or list of per-chain seeds
I prefer just "seed."
• chain_ids - list of positive integer ids • data - either a Dict or str which is the name of a JSON or Rdump file • init_params - initial param values - allows types: Dict, float, str, or function
I like "inits"
• max_treedepth - currently treedepth
This is better than treedepth.
• metric - only allows values 'diag_e' or 'dense_e' - the equivalent of 'unit_e' requires supplied vector of 1s and no metric adaptation • metric_file - either single filename or list of filenames - shape of metric values must match metric type
Can't we just have this be another variant, where the metric can be one of those two stringsor a file name? I'd use "file name" or "path" over "string" to indicate the intended type.
• stepsize - either single stepsize or list of per-chain stepsizes
Shouldn't this be "step_size" to be consistent?
• (don't expose CmdStan arg "stepsize_jitter" - see stan-dev/stan#2183: "Random jittering is not effective at getting the sampler through regions of high-curvature")
Correct. It's useful in pure HMC when you want to avoid cycles.
• warmup_iter - CmdStan arg "num_warmup" • sampling_iter - CmdStan arg "num_sample" - using term 'iter' instead of draw to avoid confusion with terms "sample" and "draw". by removing "thin" arg (i.e., thin=1), sampling iterations == num draws
• (don't expose CmdStan arg "thin") - HMC is efficient, don't need to thin (unlike Gibbs samplers)
We need to thin for space in some cases, so this should stay.
• save_warmup (?) - when is this useful?
Yes, we want the option to save warmup iterations.
• adaptation arguments - flatten • adapt_engaged - boolean - adaptation requires warmup_iters > 0 - under some circumstances may want adapt_engaged = False, warmup_iters > 0
I think this is confusing with "warmup_iters".
• adapt_delta - CmdStan "delta"
This should be something like "target_accept_rate" if we want to go with readable names.
• (not exposing adapt_gamma, adapt_kappa, adapt_t0 - no guidance available on non-default settings)
I'm OK with not exposing these.
• adapt_init - CmdStan arg "init_buffer" - phase I of adaptation -find typical set - adjusts stepsize (fast) (default 75 iters)
This is the only one of the controls you're allowing?
• adapt_metric - CmdStan arg "window" - phase II of adaptation - iteratively tune metric and stepsize - default interval starts at 25 iters, keeps doubling (should be 75% of total warmups, i.e. at least 500 iterations. CmdStan guide diagram corresponds to 1000 warmup iters)
This is very confusing, as the name sounds like a flag but it takes a numeric value.
• adapt_stepsize - CmdStan arg "term_buffer" - phase III of adaptation- re-adjust step-size (fast) (50)
Same problem with this one.
The bigger problem I was objecting to is that these all have to be consistent with each other. The sum of the three window sizes has to be less than the number of warmup iterations. That doc needs to be on every single relevant parameter, which I think is four here (the three window (initial) sizes and the total number of warmup iterations).
Raw numbers here means that with increasing iterations, we don't get larger I and III buffers or initial II buffer, just more II iterations (in the language of our existing doc). With percentages these would adjust.
In the past, we've grouped the window sizes into one group and the dual averaging adaptation parameters in another.
• output_file - string used to determine name of csv files as well as diagnostic files
How does this string determine the name of csv and diagnostic files? It needs better doc. You should be aiming at writing the user-facing doc at this point.
• refresh - better name? - "-1" == "silent" mode for batch processing - ideally use progressbar, spinner or otherwise monitor csv output file
Good luck on this one. I would strongly prefer there not be multiple ways to turn things off that interact.
Raw numbers here means that with increasing iterations, we don't get larger I and III buffers or initial II buffer, just more II iterations (in the language of our existing doc). With percentages these would adjust.
OK, in the long run this makes a lot of sense. in the short run we will need to map this onto the way that CmdStan specifies things now - not hard to do.
I went spelunking through the Stan code base - the services layer reflects the arguments used by stan::mcmc::windowed_adaptation
, so the interface will need to do the translation from percentages to iterations.
Right---they're non-negative integers in the current interface.
On Jun 14, 2019, at 10:38 AM, Mitzi Morris notifications@github.com wrote:
Raw numbers here means that with increasing iterations, we don't get larger I and III buffers or initial II buffer, just more II iterations (in the language of our existing doc). With percentages these would adjust.
OK, in the long run this makes a lot of sense. in the short run we will need to map this onto the way that CmdStan specifies things now - not hard to do.
I went spelunking through the Stan code base - the services layer reflects the arguments used by stan::mcmc::windowed_adaptation, so the interface will need to do the translation from percentages to iterations.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
how about combining parameters init_buffer
, window
, and term_buffer
into a triple called either warmup_schedule
or adapt_schedule
which specifies the pct of warmup iterations to be spent in phase I ("init_buffer" - fast adaptation, find typical set), phase II ("window" - slow adaptation to find the metric), and phase III ("term_buffer" - fast adaptation to (re)adjust the step_size). The values would have to be between 0:1 and sum must be <= 1.
these could also be specified as three individual parameters - adapt_phase_1
, adapt_phase_2
, and adapt_phase_3
and only 2 out of 3 need be specified. these are also specified as percentages and must sum to <= 1.
I like the suggestion of changing delta
to target_accept_rate
although this will be out of sync with CmdStan diagnostics.
param refresh
doesn't make sense in the current implementation as the console output from each chain is captured by the background process and not sent to the console.
if we implement https://github.com/stan-dev/cmdstanpy/issues/46, then we should add argument "show_progress".
CmdStan diagnostics don't know about adapt_delta in the output. You need to validate inputs on the Python side to get reasonable error messages with your parameter names. CmdStan shouldn't be throwing config errors when called.
On Jun 14, 2019, at 5:47 PM, Mitzi Morris notifications@github.com wrote:
I like the suggestion of changing "delta" to "target_accept_rate" although this will be out of sync with CmdStan diagnostics.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
CmdStan's diagnose
command mentions adapt delta
:
95 of 4000 (2.4%) transitions ended with a divergence. These divergent transitions indicate that HMC is not fully able to explore the posterior distribution. Try rerunning with adapt delta set to a larger value and see if the divergences vanish. If increasing adapt delta towards 1 does not remove the divergences then you will likely need to reparameterize your model.
Got it. That's unfortunate.
I don't know if it'd be better to choose a better name for cmdstanpy and then deal with the mismatch or just stick with "adapt_delta". It just seems a shame to use something so jargony when there's a clear alternative like "target_accept_rate".
On Jun 14, 2019, at 8:17 PM, Mitzi Morris notifications@github.com wrote:
CmdStan's diagnose command mentions adapt delta:
95 of 4000 (2.4%) transitions ended with a divergence. These divergent transitions indicate that HMC is not fully able to explore the posterior distribution. Try rerunning with adapt delta set to a larger value and see if the divergences vanish. If increasing adapt delta towards 1 does not remove the divergences then you will likely need to reparameterize your model.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Is there any real reason to change this, before Stan3 lands?
I really like target_accept_rate
. it wouldn't be hard to reword output from diagnose
command to make it clear that accept_delta
is the target acceptance rate - I've already make the diagnose
command output more verbose - this is just one further tweak.
good names matter.
Could we update CmdStan to output "adapt delta (target acceptance rate)"?
yes, will do. https://github.com/stan-dev/cmdstan/issues/711
Summary:
Sample function should allow HMC sampler argument
metric
which takes one of two values:diag_e
(diagonal Euclidian) ordense
.Description:
metric
attribute toSamplerArgs
object - update functionsvalidate
andcompose_command
accordinglymetric
argument to sample commandNote: if
metric
isdense
andmetric_file
is non-null, then mass_matrix must be a matrix; else ifmetric_file
is non-null, specified mass_matrix must be a vector (i.e., just the diagonal of the mass_matrix).Additional Information:
Provide any additional information here.
Current Version: