Output API refactor, algorithm-side

sakrejda commented 6 years ago

This issue lists the current outputs, and then a minimal set of signatures (those are the two sections below). After each sub-section there's a list of issues with implementing this, some of which I still need to check against the code.

Currently defined outputs

Services-defined:

double values representing the time taken for $A$ algorithm stages (warmup, sampling in MCMC).

Model-file defined:

$K$ parameter names, provided once (double)
$P$ transformed parameter names, provided once (double)
$Q$ generated quantities, provided once (double and int)

Model-file defined for sampling:

$K$ model ('constrained') parameter values, provided per-iteration, so ultimately $K \times N$, with $N$ iterations.
$K$ algorithm ('unconstrained') parameter values, provided per-iteration so ultimately $K \times N$
$K$ algorithm ('unconstrained') momentum values, provided per-iteration so ultimately $K \times N$
$K$ algorithm ('unconstrained') gradient values, provided per-iteration so ultimately $K \times N$
$P$ transformed parameter values, provided per-iteration, so ultimately $P \times N$, with $N$ iterations.
$Q$ generated quantities, provided per-iteration, so ultimately $Q \times N$, with $N$ iterations.

Model-file defined, for HMC:

$K$ algorithm ('unconstrained') parameter values, provided per iteration if requested so ultimately $K \times N$
$K$ momentum parameter values, provided per iteration if requested so ultimately $K \times N$
$K \times K$ or $1\times K$ mass matrix, provided on request (at most $K\times K \times N$).

Model-file defined for optimization:

$K$ model ('constrained') parameter values, provided once as an estimate
$K$ model ('constrained') parameter values, provided per-iteration if requested
$P$ model transformed parameter values, provided once as an estimate
$P$ model transformed parameter values, provided per-iteration if requested
$Q$ model generated quantities, provided once as an estimate
$Q$ model generated quantities, provided per-iteration if requested
$K \times K$ hessian matrix, provided once at a given point.
$K$ gradient parameter values, provided per iteration if requested so ultimately $K \times N$ ... not sure that we actually have that plumbed r.n.

Model-file defined for ADVI:

$K$ model ('constrained') parameter values, provided once as an estimate
$K$ model ('constrained') parameter values, provided per iteration if requested (so $K\times N$.
$K \times K$ covariance estimate, provided once at a given point.
$K \times K$ covariance estimate, provided per-iteration if requested (no plumbed r.n.)

Model-file defined, for whatever algorithm:

$K \times M$ algorithm parameter values holding state per $N$ algorithm steps.
$K$ gradient parameter values, provided per iteration if requested so ultimately $K \times N$ (for any gradient-based algorithm).

Algorithm-defined for HMC:

iteration, per-iteration
adapted stepsize, double, per-iteration during warmup (fixed in sampling), so $N_w$
treedepth (per-iteration), $N$
acceptance rate, $N$, per-iteration
energy, $N$, per-iteration
log-density, $N$, per-itertaion
number of integrator (leapfrog) steps, $N$, per-iteration
divergence, $N$, per-iteration

Algorithm-defined for BFGS/LBFGS:

iteration
log-density, $N$, per-iteration
$||dx||$, per-iteration
$||grad||$, per-iteration
$\alpha$, per-iteration
$\alpha_0$, per-iteration
number of evals, per-iteration
"notes" <- Huh, should be broken out
termination/convergence messages

Algorithm-defined, for Newton optimization:

... uh, same informational message on initialization failure as MCMC... (?)
iteration,
log-density, $N$
improvement in the log-density, $N$

Algorithm-defined, for ADVI:

iteration, $N$
ELBO, $N$
delta_ELBO_mean, $N$
delta_ELBO_med, $N$
notes, $N$
"MEAN ELBO CONVERGED" (once)
"MEDIAN ELBO CONVERGED" (once)
"MAY BE DIVERGING... INSEPCT ELBO" (once?)
informational messages (once?)
"time in seconds" for iteration, $N$
ELBO, $N$
"eta" (adapted) , once
"drawing a sample of size X from approximate posterior", once
"completed" , once

Signatures a `relay` needs to handle

A relay is composed of writers, it's the only object passed into services to handle output.

On construction the relay needs

std::vector<std::string> parameter_names
std::vector<std::string> transformed_parameter_names
std::vector<std::string> generated_quantity_names_r
std::vector<std::string> generated_quantity_names_i

The model-file-defined required signatures are:

relay.write_parameters(std::vector<double>);
relay.write_unconstrained_parameters(std::vector<double>);
relay.write_transformed_parameters(std::vector<double>);
relay.write_generated_quantities_r(std::vector<double>);
relay.write_generated_quantities_i(std::vector<double>);
relay.write_parameter_gradients(std::vector<double>);

Issues:

Generated quantities could be split to avoid merging integer and double-valued generated quantities but currently they are collapsed early (in model.write_array) so we should evaluate whether it's feasible to undo that in the sampling code.
Currently mcmc_writer.write_sample_params collapses the algorithm-defined parameters and the model-file-defined parameters (lines 94, 95, 101). For the above signatures we would need to undo that and write the algorithm parameters to a separate relay signature.

Model-file-defined for sampling, so in a sampling-specific sub-class

relay.write_parameter_momenta(std::vector<double>);

Model-file-defined for optimization, so in a optimization-specific sub-class

relay.write_hessian(Eigen::Matrixd hessian) (CHECK: is this generated or just rstan? Q: should we generate this?)

Model-file-defined for ADVI, ADVI-specific sub-class

relay.write_covariance(Eigen::Matrixd hessian)

Algorithm-related:

relay.write_iteration(int iteration);
relay.write_stage(std::string stage); <- this should be, "warmup"/"sampling" for MCMC, could indicate different warmup stages within HMC with adaptation, could be "optimizing"/"converged" for optimization. Since we often want to know the timing information stage-wise the relay class could run a timer and record how long each stage took as well.

Algorithm-defined, HMC:

relay.write_log_density(double);
relay.write_stepsize(double)
relay.write_tree_depth(int);
relay.write_acceptance_rate(double);
relay.write_energy(double);
relay.write_n_leapfrog(int);
relay.write_divergence(bool);
relay.write_mass_matrix(Eigen::Vectord mm); <- @mitzimorris
relay.write_mass_matrix(Eigen::Matrixd mm);<- @mitzimorris
relay.write_trajectory(???); <- @martinmodrak

Issue: These are actually grouped in various ways.

Algorithm-defined, BFGS/LBFGS:

relay.write_log_density(double);
relay.write_dx_magnitude(double);
relay.write_grad_magnitude(double);
relay.write_alpha(double);
relay.write_alpha_0(double);
relay.write_n_evals(double);

Issue: These are actually grouped in various ways.

Algorithm-defined, BFGS/LBFGS, CHECK THESE:

"notes" <- Huh, should be broken out
termination/convergence messages

Algorithm-defined, Newton:

relay.write_log_density(double);
relay.write_delta_log_density(double);

Algorithm-defined ADVI:

relay.write_ELBO(double);
relay.write_delta_ELBO_mean(double);
relay.write_delta_ELBO_med(double);
relay.write_iteration_time(double); // standardize? time_type?
relay.write_eta(double); (called once)

Issue: These are actually grouped in various ways.

Issues:

ADVI could just be cleaned up as a step 1 here.

Incorporating the logger, use paramater packs to make life easier:

relay.debug(...);
relay.info(...);
relay.warn(...);
relay.error(...);
relay.fatal(...);

Grouping within the algorithms:

HMC:

stan/mcmc/sample.hpp has: const Eigen::VectorXd& q, double log_prob, double stat
Dealing with sampler types ... (?)

Optimization

ADVI

martinmodrak commented 6 years ago

Thanks for writing this down. I however have a different thinking about the way to handle the within-iteration (per step and similar) reporting. In general the design as outlined seems to focus on data sent per iteration, and data sent once (except for the string logger). I don't think that's sufficient - my original use case (more info about divergent transitions) requires sending info only for some iterations. Some extensions I can think of (e.g. "almost divergent transitions") may also need to send multiple records per single iteration. I therefore think the design has to incorporate that there are different sets of data sent at differing frequencies. At least: per-iteration, per-step, per special condition (e.g., divergent transition).

While I'd guess that in practice, storing the full trajectory will likely be used mostly with restarting and for selected iterations, I don't see any advantage in designing it only for this use case. There simply should be a (compile-time or runtime) switch to record per-step information.

I also don't like having a write_trajectory method as that implies that the sampler will need to store the per-step data in some intermediate in-memory storage. I'd rather go with multiple different "streams" of sampler state data with different frequencies, where one will store per-step info.

On the other hand I think that for all of the different writing frequencies we actually want to write a subset of: {iteration number, constrained params, unconstrained params, transformed params, gradients, momenta}, so there is a place for abstraction as write_sampler_state() that is reused per-iteration, per-step and per special conditions. I am not sure if the subset should be the same for all frequencies or if we should allow separate configurations for each but I am slightly in favor of unified configuration. Although we might need to be able to write some additional information for the special "streams" (e.g. direction and tree depth for the per-step info, the difference between the Hamiltonian at the start of the trajectory and at the end when also reporting some sort of "approaching divergence, but not divergent yet" steps for divergence diagnostics).

Also I would add that it might be sensible to also pass transformed data once at the beginning of the algorithm. The structure should IMHO follow similar signatures as the gen. quants. (both are mix of double and int).

sakrejda commented 6 years ago

@martinmodrak If you look at the code we go through contortions to send things at the same time so there definitely is a need to allow sending separately. On the writer end things can be combined but it doesn't make sense to enforce that on the C++ side.

I don't see in your write-up what you think does make sense? I can go look at where this data becomes available but I thought you'd have some specific suggestions.

re: transformed data, sure we could add that but it's not necessary while we're doing a refactor and it follows the same pattern as GQ so I'm punting for now.

martinmodrak commented 6 years ago

there definitely is a need to allow sending separately

I don't really follow your thoughts here. What "sending" are you referring to? Do you think there is an actual advantage in buffering data on the C++ side before sending to writer/interface? Or are you speaking about bundling the various types of data (sampler params, constrained + unconstrained params ,gen quants) together?

In general, I think it makes sense to stream things as they become available (e.g. spit out per-step or per-divergence info immediately in the code that detects them) and only somehow "label" where they belong (per iter, per step, per divergence,. ..) which could be done by calling a specific relay method or additional parameter to fixed relay method. This should IMHO result in the least invasive changes to the sampler code. If there is a desire for buffering, it should be handled within the relay or within the interface-specific implemenation.

sakrejda commented 6 years ago

By "sending" I mean the point in the algorithm C++ code where you write relay.write_some_stuff(stuff);, from reading your second paragraph I think we agree on "send as it becomes available". Can you point me to where you would do this for the per-trajectory info?

martinmodrak commented 6 years ago

For NUTS, I would call the relay from base_nuts::build_tree - both for per-step (trajectory) info and for per-divergence info.

bob-carpenter commented 6 years ago

Transformed data is like GQ and like parameters, but only gets output once, not once per iteration.

What is in scope here? If it's just a refactor, wouldn't that take out any trajectory information?

Even if the first version is just a refactor, we need to make sure the design will be able to accomodate these things (output that's irregular [only some iterations], output that's multiples on one iteration, and output that's once per run).

bob-carpenter commented 6 years ago

I agree. Buffering should be a filter, not built into the API.

On Jun 25, 2018, at 3:00 AM, Martin Modrák notifications@github.com wrote:

there definitely is a need to allow sending separately I don't really follow your thoughts here. What "sending" are you referring to? Do you think there is an actual advantage in buffering data on the C++ side before sending to writer/interface? Or are you speaking about bundling the various types of data (sampler params, constrained + unconstrained params ,gen quants) together?

In general, I think it makes sense to stream things as they become available (e.g. spit out per-step or per-divergence info immediately in the code that detects them) and only somehow "label" where they belong (per iter, per step, per divergence,. ..) which could be done by calling a specific relay method or additional parameter to fixed relay method. This should IMHO result in the least invasive changes to the sampler code. If there is a desire for buffering, it should be handled within the relay or within the interface-specific implemenation.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

sakrejda commented 6 years ago

From @bob-carpenter, re "what is in scope here?" and:

Even if the first version is just a refactor, we need to make sure the design will be able to accomodate these things (output that's irregular [only some iterations], output that's multiples on one iteration, and output that's once per run).

Yes this is the goal. First PR's should be primarily refactor so we can avoid too many interface-side changes, but we need the design to include:

transformed data
buffering as a filter
irregular output

I'll edit the issue to address these things.

betanalpha commented 6 years ago

A few comments, apologies if they mix discussions from this email thread and the corresponding Discourse thread.

C++ API vs Services API

The way we laid out the code the services API was designed to be the only API. In other words, there is no generic C++ API. The contract is that the client provides a model instance and the appropriate callbacks (mostly writers) and calls a given service route and the API takes care of the rest. All of the code not in the services folder is internal to the API. The current difficulty in passing a complete configuration through these service routes makes the API not particularly pleasant to use, but regardless that was the intended design.

Write Refactor/Replacement

From that perspective, the operative question here is whether to replace the writer callbacks with some other form of callback. The proliferation of writers is indeed a mess, but we have to be careful in possible replacements. For example, monolithic callbacks capable of using algorithm-specific information would either have to be aware of the entire API, and hence break the API abstraction. Or we'd have to define callbacks unique to each service route which would be its own maintenance mess.

That said, there is only so much information that the algorithm can pass and there are some relatively natural abstractions. Service routes can supply

Information, warning, or error messages meant to be directed to the user
Configuration information
Initialization information
State information, including the latent unconstrained parameter space, transformed parameters, generated quantities, and algorithmic state.
Termination information.

(1) can be handled with a logger capable of handling different states of output. (2) can be handled with a key-value abstraction. (3) and (5) can be handled with loggers or key-values depending on the specific information. (4) can be handled with a monolithic vector paired with a configuration vector of names and possibly categories for what part of the state a given element belongs (unconstrained, constrained, transformed, generated, algorithm, etc).

Ultimately we don't need to define custom callbacks for each algorithm, nor would be really want to. Instead I think that the proper abstraction here is a communication callback pattern capable of handling text messages with caution level, key-value pairs, and state information. In particular, this pattern would allow algorithms to be added and updated without having to modify the writer callbacks defined in the interfaces, and vice versa, a good indication that it's a step towards the right abstraction.

I played around with a very preliminary version of this a while back but it was never a priority. In any case, it's relatively straightforward to define a callback scheme and put together an exemplar C++ implementation.

On Tue, Jun 26, 2018 at 1:26 PM, Krzysztof Sakrejda < notifications@github.com> wrote:

From @bob-carpenter https://github.com/bob-carpenter, re "what is in scope here?" and:

Even if the first version is just a refactor, we need to make sure the design will be able to accomodate these things (output that's irregular [only some iterations], output that's multiples on one iteration, and output that's once per run).

Yes this is the goal. First PR's should be primarily refactor so we can avoid too many interface-side changes, but we need the design to include:

transformed data

buffering as a filter

irregular output

I'll edit the issue to address these things.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/stan/issues/2557#issuecomment-400273610, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdNlrVtLY2FCxwHMimy8bTFtY7YfSUbks5uAhp1gaJpZM4U1IpD .

bob-carpenter commented 6 years ago

C++ API vs Services API

A few comments, apologies if they mix discussions from this email thread and the corresponding Discourse thread.

C++ API vs Services API

I'm pretty sure everyone agrees that the services API should be the single point of contact for the interfaces.

Even if the sole client of a lower-level C++ API is the services API, it'll still be worth writing the code this way so that the basic functions can be tested. It's a huge pain to test math functions through indirect I/O interfaces, callbacks, etc. That's why I suggested putting some of these functions in the math library, but I've been convinced that because they're primarily estimators whose behavior we might change, they should instead be treated like our algorithms rather than like our math lib.

Write Refactor/Replacement

For example, monolithic callbacks capable of using algorithm-specific information would either have to be aware of the entire API, and hence break the API abstraction. Or we'd have to define callbacks unique to each service route which would be its own maintenance mess.

Exactly my objection to CmdStan and why I wanted to break services into individual functions (I'm not sure if there's a one-one relation between what I'm calling services functions and what Michael's calling a "route").

Each such function will get its own custom writer made up of a few shared components that can have common implementations.

This is also the plan for simplifying the service functions, but that's a separate topic.

That said, there is only so much information that the algorithm can pass and there are some relatively natural abstractions. Service routes can supply

Information, warning, or error messages meant to be directed to the user

Configuration information

Initialization information

State information, including the latent unconstrained parameter space, transformed parameters, generated quantities, and algorithmic state.

Termination information.

(1) can be handled with a logger capable of handling different states of output.

I think everyone agrees about this. And that this is the thing where we won't need to worry about backward compatibility.

(2) can be handled with a key-value abstraction.

Yes, but it's non-trivial in C++. Are there types or is everything a string? Do we use variant types to make everything type safe or some kind of C-style dynamic cast thing like underlying the config for CmdStan?

(3) and (5) can be handled with loggers or key-values depending on the specific information.

I'm not sure what initialization information is. Does that mean things like the initial parameter values? If that is just logging and key-values, then everything is!

(4) can be handled with a monolithic vector paired with a configuration vector of names and possibly categories for what part of the state a given element belongs (unconstrained, constrained, transformed, generated, algorithm, etc).

I couldn't quite follow that. Is the output some kind of tagged vectors, i.e., more key values? Or is it like a database table with headers and streaming? I think this all needs to be handled in such a way that the regular draws can be sent to one place, unconstrained draws to a second place, and diagnostics to a third, along with things like adaptation, which go yet other places.

Ultimately we don't need to define custom callbacks for each algorithm, nor would be really want to.

I agree. There are only a handful of possible callback types after you unpack the complexity in (4). Each algorithm uses a custom combination of simpler loggers and writers so that the interfaces can have clean low-level implementations of these.

Instead I think that the proper abstraction here is a communication callback pattern capable of handling text messages with caution level, key-value pairs, and state information.

That's not very specific, but I think it's in line with Krzysztof's design goals and how Martin's already started coding.

In particular, this pattern would allow algorithms to be added and updated without having to modify the writer callbacks defined in the interfaces, and vice versa, a good indication that it's a step towards the right abstraction.

I don't think we can update without modifying callbacks. If we want to add per-iteration trajectory information, the interfaces have to figure out somewhere to put that. Or are you saying everything just gets buffered into some kind of giant object on the interface side? I'd think things would have be streamed out into different files, for example. That isn't going to happen by magic.

I also think new algorithms are a fine place for interfaces to cobble together the relevant callbacks. This isn't hard work.

Bob

bob-carpenter commented 6 years ago

C++ API vs Services API

I'm pretty sure everyone agrees that the services API should be the single point of contact for the interfaces.

Even if the sole client of a lower-level C++ API is the services API, it'll still be worth writing the code this way so that the basic functions can be tested. It's a huge pain to test math functions through indirect I/O interfaces, callbacks, etc. That's why I suggested putting some of these functions in the math library, but I've been convinced that because they're primarily estimators whose behavior we might change, they should instead be treated like our algorithms rather than like our math lib.

Write Refactor/Replacement

For example, monolithic callbacks capable of using algorithm-specific information would either have to be aware of the entire API, and hence break the API abstraction. Or we'd have to define callbacks unique to each service route which would be its own maintenance mess.

Exactly my objection to CmdStan and why I wanted to break services into individual functions (I'm not sure if there's a one-one relation between what I'm calling services functions and what Michael's calling a "route").

Each such function will get its own custom writer made up of a few shared components that can have common implementations.

This is also the plan for simplifying the service functions, but that's a separate topic.

That said, there is only so much information that the algorithm can pass and there are some relatively natural abstractions. Service routes can supply

Information, warning, or error messages meant to be directed to the user

Configuration information

Initialization information

State information, including the latent unconstrained parameter space, transformed parameters, generated quantities, and algorithmic state.

Termination information.

(1) can be handled with a logger capable of handling different states of output.

I think everyone agrees about this. And that this is the thing where we won't need to worry about backward compatibility.

(2) can be handled with a key-value abstraction.

Yes, but it's non-trivial in C++. Are there types or is everything a string? Do we use variant types to make everything type safe or some kind of C-style dynamic cast thing like underlying the config for CmdStan?

(3) and (5) can be handled with loggers or key-values depending on the specific information.

I'm not sure what initialization information is. Does that mean things like the initial parameter values? If that is just logging and key-values, then everything is!

(4) can be handled with a monolithic vector paired with a configuration vector of names and possibly categories for what part of the state a given element belongs (unconstrained, constrained, transformed, generated, algorithm, etc).

I couldn't quite follow that. Is the output some kind of tagged vectors, i.e., more key values? Or is it like a database table with headers and streaming? I think this all needs to be handled in such a way that the regular draws can be sent to one place, unconstrained draws to a second place, and diagnostics to a third, along with things like adaptation, which go yet other places.

Ultimately we don't need to define custom callbacks for each algorithm, nor would be really want to.

I agree. There are only a handful of possible callback types after you unpack the complexity in (4). Each algorithm uses a custom combination of simpler loggers and writers so that the interfaces can have clean low-level implementations of these.

Instead I think that the proper abstraction here is a communication callback pattern capable of handling text messages with caution level, key-value pairs, and state information.

That's not very specific, but I think it's in line with Krzysztof's design goals and how Martin's already started coding.

In particular, this pattern would allow algorithms to be added and updated without having to modify the writer callbacks defined in the interfaces, and vice versa, a good indication that it's a step towards the right abstraction.

I don't think we can update without modifying callbacks. If we want to add per-iteration trajectory information, the interfaces have to figure out somewhere to put that. Or are you saying everything just gets buffered into some kind of giant object on the interface side? I'd think things would have be streamed out into different files, for example. That isn't going to happen by magic.

I also think new algorithms are a fine place for interfaces to cobble together the relevant callbacks. This isn't hard work.

Bob

sakrejda commented 6 years ago

There actually are multiple layers of internal C++ API in Stan. Even though we haven't made any promises about these it's not like they're going to change anytime soon.

Instead I think that the proper abstraction here is a communication callback pattern capable of handling text messages with caution level, key-value pairs, and state information. In particular, this pattern would allow algorithms to be added and updated without having to modify the writer callbacks defined in the interfaces, and vice versa, a good indication that it's a step towards the right abstraction.

We've discussed this along the way, including Allen's (slightly extreme) suggestion of just making all the I/O JSON (which also allows you to add algorithms without having to modify the writer callbacks in the interfaces). I think it's a good set of abstractions to have internally.

What this approach doesn't communicate is how a user of the API should use the output of the algorithms. This leads to incompatible and rigid dependencies on odd stuff like the exact message formats in the interfaces. While the approach here is a work in progress, the relay classes should be relatively simple and straightforward to modify for new algorithms, internally they also will definitely be constructed from the abstractions you mention (key/value + logger + table, basically).

sakrejda commented 6 years ago

Going through the algorithm code now since I sort of assumed it would make sense to let the algorithm code individually write things (like it does now). It might make more sense to have a relay method that's a friend of algorithm classes for handling the stan::mcmc::sample and stan::mcmc::base_nuts. Not sure what's the right way to go., obv. easier to keep current usage in the algorithms.

sakrejda commented 6 years ago

The central logger thing is being done instead so this is dated now. Closing.

stan-dev / stan