stan-dev / cmdstanpy

CmdStanPy is a lightweight interface to Stan for Python users which provides the necessary objects and functions to compile a Stan program and fit the model to data using CmdStan.
BSD 3-Clause "New" or "Revised" License
150 stars 69 forks source link

add logic to `sample` function to handle dense HMC metric #28

Closed mitzimorris closed 5 years ago

mitzimorris commented 5 years ago

Summary:

Sample function should allow HMC sampler argument metric which takes one of two values: diag_e (diagonal Euclidian) or dense.

Description:

Note: if metric is dense and metric_file is non-null, then mass_matrix must be a matrix; else if metric_file is non-null, specified mass_matrix must be a vector (i.e., just the diagonal of the mass_matrix).

Additional Information:

Provide any additional information here.

Current Version:

maedoc commented 5 years ago

A neat usage on top of this is continuing a chain. In HPC situations, a 24 hour wall time is typical and continuation of a chain would be enormously helpful (especially on slower CPUs like Xeon Phi).

mitzimorris commented 5 years ago

continuing a chain once it's reached convergence?

bob-carpenter commented 5 years ago

This is where we need to watch out for argument dependencies. Even if no initial metric is specified, there needs to be a way to specify whether to use a diagonal or dense version. Then that specification needs to be consistent with the specification of the metric. The services pull these options into two different functions (dense_e and diag_e for dense and diagonal Euclidean).

The other issue with allowing the argument to be either is that it should naturally be a strictly positive, finite vector for the diagonal case and positive definite matrix for the dense case. Are these the same type in Python? If not, how do we deal with what's essentially a variant type argument?

One thing we could do is not have a separate argument to indicate diagonal or dense, but rather always require an initial metric and just give it a default value of a vector of all 1s. Then if you want to get our current default for the dense case, the user would have to specify a unit matrix as the argument.

Then, I think we can get away without having adapt/no-adapt cases. This is important for restarting as @maedoc points out. The no-adapt cases will just use whatever the initial metric is (defaults to unit, but can be specified as diagonal or dense). The only thing this doesn't allow, is no adapt with warmup (throwing away initial segment of chains). That could perhaps be handled separately in post-processing rather than as part of the nuts function.

bob-carpenter commented 5 years ago

Just to be more clear, what I'm proposing is

a = vector of 1 values
b = unit matrix
a2 = user-supplied vector
b2 = user-supplied matrix
model.sample(metric = ..., ...)

where the metric could be any of these four things, with a default to the vector of 1s. Presumably there's an easy way in NumPy to create a unit matrix and a vector of 1 values so the user doesn't have to build it up manually.

bob-carpenter commented 5 years ago

Once adaptation parameters has converged (which presupposes arrival in the typical set in order to estimate the metric), we can run more sampling iterations from where we left off (position, metric, step size) to get larger effective sample sizes.

If adaptation hasn't converged, we can take the final state of the metric and step size after adaptation and restart adaptation at that point. Any sampling attempted when adaptation hasn't converged is likely to be unreliable, so we'd throw that out.

On May 17, 2019, at 7:49 AM, Mitzi Morris notifications@github.com wrote:

continuing a chain once it's reached convergence?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

maedoc commented 5 years ago

finite vector for the diagonal case and positive definite matrix for the dense case. Are these the same type in Python? If not, how do we deal with what's essentially a variant type argument?

NumPy arrays don't distinguish among those exception for 1d vs 2d, so some validation would be required for pos def, but defaults are straightforward (np.ones(N) or np.identity(N)).

Once adaptation parameters has converged

I was curious if the adaptation schedules in CmdStan are maintained in some resource file or hard coded in C++? It would be useful to reuse those schedules from Python e.g. for difficult problems or circumventing walltime restrictions.

mitzimorris commented 5 years ago

the adaptation schedules are available as CmdStan parameters init_buffer, window, and term_buffer - these got overlooked when I wrote up the spec - I don't know why, as they're clearly arguments that are passed all the way through to the NUTS sampler - my bad.

bob-carpenter commented 5 years ago

I was suggesting the following factoring of the sampling config (these are not proposals for names, just to identify the pieces):

warmup_iter: # warmup iterations sampling_iter: # sampling iterations (init_buffer_frac, window_frac, term_buffer_frac): values between 0 and 1 summing to less than 1.

Then we set, e.g.,

init_buffer = init_buffer_frac * warmup_iter

This is so that the arguments don't depend on each other. Otherwise, we'd have to make sure that init_buffer + window_frac + term_buffer_frac < warmup.

I think it makes sense to scale up the buffers and windows with warmup size, but I could see someone wanting to fix that initial fast buffer size or something rather than treating it as a proportion.

mitzimorris commented 5 years ago

how about reworking the arguments to the sample function as follows:

sakrejda commented 5 years ago

I like the suggested argument names and also have some suggestions that might be interesting: 1) 'max_treedepth' could also be 'treedepth_cap'; 2) init_params seems overloaded, what about 'init_values' and 'init_generator' as separate default parameters (the latter could be the hypercube by default)?; 3) 'metric' could be 'metric_type' and we could call the options 'diagonal' and 'dense' because that seems to be what people call them (almost nobody says '_e' afaict); 4) I always run with 'save_warmup' so I can catch failing models before they waste too much time, it's not hard to diagnose failing models in the first hundred iterations, I think both this argument and thin should be the output handler's job. Ditto for refresh. I guess that's going against the "flat" thing but in my defense it could be a well documented default helper function...

So basically just suggestions, this looks great.

mitzimorris commented 5 years ago

thanks @sakrejda! going to take suggestions 2 & 3. agreed that we need better output handling. I think for now we have to make do, (but I'm a pragmatical bastard).

ahartikainen commented 5 years ago

Should these follow CmdStan keywords?

mitzimorris commented 5 years ago

yes, as much as possible - which is an argument for metric instead of metric_type. also it's shorter.

it's a tradeoff - we want names that are terse, informative, and as much as possible consistent with what's already being used across interfaces.

the thing that always trips me up is the difference between RStan and CmdStan - CmdStan has num_warmup and num_samples (sic) and RStan has iter which includes warmup and param warmup use to override default allocation of 1/2 iter as warmup, sampling iterations respectively.

so CmdStanPy will go with CmdStan in that you don't specify total iterations, but with better names.

it's bad practice to be changing the API, but I got a bunch of things wrong and am trying to correct them now - your input most valued!

ahartikainen commented 5 years ago

PyStan follows RStan API, but RStan3, PyStan3 (and CmdStan3) should follow the same API (which is very similar to CmdStan API).

mitzimorris commented 5 years ago

so you're OK with warmup_iter and sampling_iter ?

ahartikainen commented 5 years ago

Yes, they sound fine.

Docstring should have a section listing all the possibilities (unless they are parameters, which means they should be described anyway.)

That is something I think is missing in CmdStan interface (fast docs for parameters; I usually need to google for the CmdStan doc pdf)

mitzimorris commented 5 years ago

right - will add docstring.

regarding CmdStan and eventual CmdStan3 - plan is to go to flat argument structure - current argument parser is really confusing. there's an issue on CmdStan to convert the CmdStan guide from latex to bookdown and put it online - https://github.com/stan-dev/cmdstan/issues/657

bob-carpenter commented 5 years ago

On Jun 13, 2019, at 11:16 PM, Mitzi Morris notifications@github.com wrote:

how about reworking the arguments to the sample function as follows:

I would generally recommend the shortest clear name.

• model • chains - default 4 chains (current good practice recommendation) • cores - default 1 • random_seed - (currently seed) - allow single value or list of per-chain seeds

I prefer just "seed."

• chain_ids - list of positive integer ids • data - either a Dict or str which is the name of a JSON or Rdump file • init_params - initial param values - allows types: Dict, float, str, or function

I like "inits"

• max_treedepth - currently treedepth

This is better than treedepth.

• metric - only allows values 'diag_e' or 'dense_e' - the equivalent of 'unit_e' requires supplied vector of 1s and no metric adaptation • metric_file - either single filename or list of filenames - shape of metric values must match metric type

Can't we just have this be another variant, where the metric can be one of those two stringsor a file name? I'd use "file name" or "path" over "string" to indicate the intended type.

• stepsize - either single stepsize or list of per-chain stepsizes

Shouldn't this be "step_size" to be consistent?

• (don't expose CmdStan arg "stepsize_jitter" - see stan-dev/stan#2183: "Random jittering is not effective at getting the sampler through regions of high-curvature")

Correct. It's useful in pure HMC when you want to avoid cycles.

• warmup_iter - CmdStan arg "num_warmup" • sampling_iter - CmdStan arg "num_sample" - using term 'iter' instead of draw to avoid confusion with terms "sample" and "draw". by removing "thin" arg (i.e., thin=1), sampling iterations == num draws

• (don't expose CmdStan arg "thin") - HMC is efficient, don't need to thin (unlike Gibbs samplers)

We need to thin for space in some cases, so this should stay.

• save_warmup (?) - when is this useful?

Yes, we want the option to save warmup iterations.

• adaptation arguments - flatten • adapt_engaged - boolean - adaptation requires warmup_iters > 0 - under some circumstances may want adapt_engaged = False, warmup_iters > 0

I think this is confusing with "warmup_iters".

  • adapt_delta - CmdStan "delta"

This should be something like "target_accept_rate" if we want to go with readable names.

  • (not exposing adapt_gamma, adapt_kappa, adapt_t0 - no guidance available on non-default settings)

I'm OK with not exposing these.

  • adapt_init - CmdStan arg "init_buffer" - phase I of adaptation -find typical set - adjusts stepsize (fast) (default 75 iters)

This is the only one of the controls you're allowing?

  • adapt_metric  - CmdStan arg "window" - phase II of adaptation - iteratively tune metric and stepsize - default interval starts at 25 iters, keeps doubling (should be 75% of total warmups, i.e. at least 500 iterations. CmdStan guide diagram corresponds to 1000 warmup iters)

This is very confusing, as the name sounds like a flag but it takes a numeric value.

  • adapt_stepsize - CmdStan arg "term_buffer" - phase III of adaptation- re-adjust step-size (fast) (50)

Same problem with this one.

The bigger problem I was objecting to is that these all have to be consistent with each other. The sum of the three window sizes has to be less than the number of warmup iterations. That doc needs to be on every single relevant parameter, which I think is four here (the three window (initial) sizes and the total number of warmup iterations).

Raw numbers here means that with increasing iterations, we don't get larger I and III buffers or initial II buffer, just more II iterations (in the language of our existing doc). With percentages these would adjust.

In the past, we've grouped the window sizes into one group and the dual averaging adaptation parameters in another.

• output_file - string used to determine name of csv files as well as diagnostic files

How does this string determine the name of csv and diagnostic files? It needs better doc. You should be aiming at writing the user-facing doc at this point.

• refresh - better name? - "-1" == "silent" mode for batch processing - ideally use progressbar, spinner or otherwise monitor csv output file

Good luck on this one. I would strongly prefer there not be multiple ways to turn things off that interact.

mitzimorris commented 5 years ago

Raw numbers here means that with increasing iterations, we don't get larger I and III buffers or initial II buffer, just more II iterations (in the language of our existing doc). With percentages these would adjust.

OK, in the long run this makes a lot of sense. in the short run we will need to map this onto the way that CmdStan specifies things now - not hard to do.

I went spelunking through the Stan code base - the services layer reflects the arguments used by stan::mcmc::windowed_adaptation, so the interface will need to do the translation from percentages to iterations.

bob-carpenter commented 5 years ago

Right---they're non-negative integers in the current interface.

On Jun 14, 2019, at 10:38 AM, Mitzi Morris notifications@github.com wrote:

Raw numbers here means that with increasing iterations, we don't get larger I and III buffers or initial II buffer, just more II iterations (in the language of our existing doc). With percentages these would adjust.

OK, in the long run this makes a lot of sense. in the short run we will need to map this onto the way that CmdStan specifies things now - not hard to do.

I went spelunking through the Stan code base - the services layer reflects the arguments used by stan::mcmc::windowed_adaptation, so the interface will need to do the translation from percentages to iterations.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mitzimorris commented 5 years ago

how about combining parameters init_buffer, window, and term_buffer into a triple called either warmup_schedule or adapt_schedule which specifies the pct of warmup iterations to be spent in phase I ("init_buffer" - fast adaptation, find typical set), phase II ("window" - slow adaptation to find the metric), and phase III ("term_buffer" - fast adaptation to (re)adjust the step_size). The values would have to be between 0:1 and sum must be <= 1.

these could also be specified as three individual parameters - adapt_phase_1, adapt_phase_2, and adapt_phase_3 and only 2 out of 3 need be specified. these are also specified as percentages and must sum to <= 1.

mitzimorris commented 5 years ago

I like the suggestion of changing delta to target_accept_rate although this will be out of sync with CmdStan diagnostics.

mitzimorris commented 5 years ago

param refresh doesn't make sense in the current implementation as the console output from each chain is captured by the background process and not sent to the console.

if we implement https://github.com/stan-dev/cmdstanpy/issues/46, then we should add argument "show_progress".

bob-carpenter commented 5 years ago

CmdStan diagnostics don't know about adapt_delta in the output. You need to validate inputs on the Python side to get reasonable error messages with your parameter names. CmdStan shouldn't be throwing config errors when called.

On Jun 14, 2019, at 5:47 PM, Mitzi Morris notifications@github.com wrote:

I like the suggestion of changing "delta" to "target_accept_rate" although this will be out of sync with CmdStan diagnostics.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

mitzimorris commented 5 years ago

CmdStan's diagnose command mentions adapt delta:

95 of 4000 (2.4%) transitions ended with a divergence. These divergent transitions indicate that HMC is not fully able to explore the posterior distribution. Try rerunning with adapt delta set to a larger value and see if the divergences vanish. If increasing adapt delta towards 1 does not remove the divergences then you will likely need to reparameterize your model.

bob-carpenter commented 5 years ago

Got it. That's unfortunate.

I don't know if it'd be better to choose a better name for cmdstanpy and then deal with the mismatch or just stick with "adapt_delta". It just seems a shame to use something so jargony when there's a clear alternative like "target_accept_rate".

On Jun 14, 2019, at 8:17 PM, Mitzi Morris notifications@github.com wrote:

CmdStan's diagnose command mentions adapt delta:

95 of 4000 (2.4%) transitions ended with a divergence. These divergent transitions indicate that HMC is not fully able to explore the posterior distribution. Try rerunning with adapt delta set to a larger value and see if the divergences vanish. If increasing adapt delta towards 1 does not remove the divergences then you will likely need to reparameterize your model.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

ahartikainen commented 5 years ago

Is there any real reason to change this, before Stan3 lands?

mitzimorris commented 5 years ago

I really like target_accept_rate. it wouldn't be hard to reword output from diagnose command to make it clear that accept_delta is the target acceptance rate - I've already make the diagnose command output more verbose - this is just one further tweak.

good names matter.

bob-carpenter commented 5 years ago

Could we update CmdStan to output "adapt delta (target acceptance rate)"?

mitzimorris commented 5 years ago

yes, will do. https://github.com/stan-dev/cmdstan/issues/711

mitzimorris commented 5 years ago

https://github.com/stan-dev/cmdstanpy/pull/52