convert CmdStan CSV output to R dump format input

bob-carpenter commented 7 years ago

Moved from https://github.com/stan-dev/stan/issues/544

In order to perform fake data simulation or posterior predictive checking, it would be nice to be able to convert the output of a Stan model from CSV format to the input for a Stan model in R dump format.

This should be structured as a command parallel to bin/print that does the conversion of an output CSV file. An alternative would be to have a model call argument that would produce R dump output.

The manual for CmdStan needs to be updated to show how to use this function. This will enable us to write a chapter in the manual on fake data and posterior predictive checks.

Be careful about type of the columns --- if there are integer generated quantities, the output can be integers.

For example, for the Bernoulli model in the introduction, a fake-data generator should look like:

data {
  int<lower=0> N;
  real<lower=0, upper=1> theta;
}
generated quantities {
  int<lower=0,upper=1> y[N];

  for (n in 1:N)
    y[n] <- bernoulli_rng(theta);
}

Related issues:

To run this, we need both the output of running the Bernoulli model and a value for N in order to provide input for this model
Doing proper posterior model generation will require empty parameters and model blocks, so update the parser so that this works (or link to a different issue); @betanalpha is working on a feature for this with a dummy sampler that can handle empty parameter vectors

maedoc commented 7 years ago

I wrote some Python (from scratch it's not PyStan) to do the conversion, mainly to automate simulation + sampling for the same model when using CmdStan. It could be a useful starting point, and though you wouldn't want me to port to c++ myself, it only uses NumPy so it should be quite portable.

bob-carpenter commented 7 years ago

Thanks, @maedoc.

Now that you mention it, RStan must have all the pieces of this implemented because of the way extract() and stan_rdump() work.

sakrejda commented 7 years ago

Sure but it's all in R code.

rok-cesnovar commented 5 years ago

Is this still relevant? Is a converter from the cmdstan CSV output to R dump still needed? My guess would be no.

bob-carpenter commented 5 years ago

Something like this is needed for restarts, but I think that'd require a new command.

maedoc commented 4 years ago

Is this still relevant? Is a converter from the cmdstan CSV output to R dump still needed? My guess would be no.

a few use cases I've wanted this for

restarts e.g. when model takes longer than walltime limit on a cluster,
simulating data and then fitting a model to its data,
multiple model workflow
intializing HMC from an optimization

I usually end up with a mess of grep, cut, tr, nl in bash for what is a pretty simple job. Two main modes would be

take 1 line of sampling CSV, convert to R/json format
take summary csv, convert to R/json format

It'd also be useful to massage CSV to convert matrices from x.1.2 style columns to 2D ascii matrices for use with GnuPlot or similar, but that's fairly outside scope.

Is input/output in JSON now part of CmdStan? That seems like the easiest way to go. I could give it a go, since it'd be miles better than the bash equivalent.

rok-cesnovar commented 4 years ago

Once https://github.com/stan-dev/cmdstanr/pull/95 is merged to cmdstanr, you will be able to read the samples and all sampler parameters (diviergent, leapfrog, etc.. ) with read_sample_csv(filenames) in R. It outputs the following list:

list(
    sampling_info 
    inverse_mass_matrix 
    warmup 
    post_warmup 
    warmup_sampler 
    post_warmup_sampler
  )

I think this is close to what you are looking for. You cant read existing cmdstan csv files, no need to run model through cmdstanr if you dont want to.

If you feel more at home with Python then try check_sampler_csv from cmdstanpy. I think it does something similar.

rok-cesnovar commented 4 years ago

Input in JSON has been a part of Cmdstan for quite some time, we just made the input a bit faster for the last release. The ouput is still csv only however.

maedoc commented 4 years ago

I'm aware of the R/Py interfaces to CmdStan as well, but was hoping to stick with a plain Bash/Makefile setup but I think for complex workflows that's just not realistic. Munging data formats on the command line is precarious esp for matrix/array datatypes.

bob-carpenter commented 4 years ago

On Dec 13, 2019, at 4:39 AM, marmaduke woodman notifications@github.com wrote:

I'm aware of the R/Py interfaces to CmdStan as well,

In case it wasn't clear to our devs not involved in CmdStanPy, the original version was derived from Marmaduke's PyCmdStan package.

rok-cesnovar commented 4 years ago

Oh haha :) Now I feel like a fool :blush:

bob-carpenter commented 4 years ago

Don't feel bad---it's a big project with too much going on for any one person to follow. I'm just trying to close the loops where I see an opportunity.

On Dec 13, 2019, at 10:32 AM, Rok Češnovar notifications@github.com wrote:

Oh haha :) Now I feel like a fool.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

stan-dev / cmdstan

convert CmdStan CSV output to R dump format input #511