stan-dev / cmdstanr

CmdStanR: the R interface to CmdStan
https://mc-stan.org/cmdstanr/
Other
144 stars 63 forks source link

Variational inference in cmdstanr missing importance_resampling that is in rstan #362

Open beyondpie opened 3 years ago

beyondpie commented 3 years ago

Under this link: https://mc-stan.org/cmdstanr/reference/model-method-variational.html , there are no parameters for "important_resampling", "keep every" in rstan (https://mc-stan.org/rstan/reference/stanmodel-method-vb.html) ?

rok-cesnovar commented 3 years ago

importance_resampling and keep_every are part of post-processing that rstan does with the loo package. See https://github.com/stan-dev/rstan/blob/c88a667015b987440668958d0dcddecdf8fd346c/rstan/rstan/R/stanmodel-class.R#L293

So this is not part of "core" Stan.

beyondpie commented 3 years ago

@rok-cesnovar Thank you so much for showing the original codes in rstan!

jgabry commented 3 years ago

If possible maybe what we should do is move most of the code for these features into the loo package itself. Then rstan and cmdstanr could both access it instead of only rstan. @avehtari @rok-cesnovar What do you think?

beyondpie commented 3 years ago

@jgabry That would be great! I'm thinking about using rstan to get the importance sampling feature ... From my current experiment, I notice that sometimes, even VI can converge, the results might be not good (maybe a local minimum), but I don't know how to check/revise the results when underfitting. I think importance sampling might be a good way?

BTW, do you have any suggestions on using "meanfield" or "fullrank" for VI. Currently I just use "meanfield". Not sure if "fullrank" is better or may have issues during optimizations such as unstable during optmization?

beyondpie commented 3 years ago

@rok-cesnovar

In order to do the important sampling, I need the prior probabilities, the likelihoods, and the approximated probabilities (instead of ELBO). Do you have some idea about how I can get the corresponding values? Not sure the meaning of lp__ and lp_approx__, does them corresponding to the likelihood and the approximated probabilities? If so, then I guess I only need to get the prior values by writing it in the generated block in stan. In this way, I can get the importance sampling purely under cmdstanr, right ?

avehtari commented 3 years ago

If possible maybe what we should do is move most of the code for these features into the loo package itself.

cmdstanr is using posterior package that has support for weighted draws. See weight_draws() and resample_draws(). The diagnostic (khat/ess for is) could be also included in posterior.

It seems that cmdstanr with variational method is returning the needed lp__ and lp_approx__. So it would be possible to call weight_draws after reading the draws from the csv, and then also update summary output to show IS based khat and ESS instead of MCMC based Rhat and ESS. Whether resampling is done, should be an option so that it is possible to investigate also the non-resampled draws.

For additional information why the current advi implementation in Stan has variability in performance see https://arxiv.org/abs/2009.00666

beyondpie commented 3 years ago

@avehtari

Thank you very much for your response! I'm reading the paper you recommend.

It seems that cmdstanr with variational method is returning the needed lp__ and lp_approx__.

  • In my stan codes, I use y ~ distr(param) way. From the stan manuscript, stan will only use the likelihood (up to an additive constant). So lp__ would be the log of total probability (including prior) up to an additive constant, right ?
  • What's the meaning of lp_approx__ in variational inference? I didn't find the definition of lp_approx__. Is this just the log of the probabilities from the approximated distribution, right?
  • If I use generated block in variational inference, are the values evaluated after the optimization stage (I mean after running the stochastic optimization on ELBO), which is just the corresponding sample values I get in VI ?

Whether resampling is done, should be an option so that it is possible to investigate also the non-resampled draws.

I don't understand this sentence. You mean I should also investigate the results from VI using MCMC based Rhat and ESS when I choose not to use the IS based khat and ESS for the resampled samples?

Thank you very much!

avehtari commented 3 years ago

So lp__ would be the log of total probability (including prior) up to an additive constant, right ?

Yes, and that is sufficient for the importance sampling used.

What's the meaning of lp_approx__ in variational inference? I didn't find the definition of lp_approx__. Is this just the log of the probabilities from the approximated distribution, right?

log densities. See http://proceedings.mlr.press/v80/yao18a.html

If I use generated block in variational inference, are the values evaluated after the optimization stage (I mean after running the stochastic optimization on ELBO), which is just the corresponding sample values I get in VI ?

The generated quantities is computed using the draws from the VI approximation.

I don't understand this sentence. You mean I should also investigate the results from VI using MCMC based Rhat and ESS when I choose not to use the IS based khat and ESS for the resampled samples?

That sentence was directed to the developers of cmdstanr, I hope they do understand it. You can ignore that sentence.

beyondpie commented 3 years ago

@jgabry A quick question: which parameter in VI controls the mini-batch size? From the document, VI in stan is based on stochastic gradient ascend. Do you know the default mini-batch size during optmization?

Thanks!

avehtari commented 3 years ago

A quick question: which parameter in VI controls the mini-batch size? From the document, VI in stan is based on stochastic gradient ascend. Do you know the default mini-batch size during optmization?

No mini-batches. It is unfortunate that stochasticity in stochastic gradient descent is so strongly associated with mini-batching while the stochasticity can be due to other reasons, too. In Stan ADVI stochasticity is due to Monte Carlo estimation of the gradient (with all data). There are non-Stan ADVI implementation examples which have stochasticity coming both from mini-batching and Monte Carlo estimation of the gradient. Mini-batching assumes factorizing likelihood, but Stan programs can have non-factorizing likelihoods and thus it's non-trivial to implement mini-batching.

rok-cesnovar commented 3 years ago

If I am understanding this correctly, everything required for this is already available in cmdstanr and the only remaining issue is that posterior would show IS based khat and ESS?

avehtari commented 3 years ago

If I am understanding this correctly, everything required for this is already available in cmdstanr and the only remaining issue is that posterior would show IS based khat and ESS?

cmdstanr has what is needed, but it would be useful if cmdstanr would add the metacolumn .log_weight to the posterior object (compute from lp__ and lp_approx__)

Right now posterior has the support for weighted draws and re-sampling, but it doesn't yet have IS diagnostics, but we have discussed adding them to posterior so that it could support appropriate summarize_draws. What to display can be copied from rstan. Ping @paul-buerkner @jgabry.

rok-cesnovar commented 3 years ago

Thanks for the clarification!

beyondpie commented 3 years ago

@avehtari @rok-cesnovar Thanks for your following comments. I've not implemented this part yet. But I think for me it seems straight forward:

Hope my understanding is right.

I set up a hierarchical Bayesian model, which has thousands of parameters. The khat evaluation is based on the joint distribution, so all the parameters will share the same khat value, which is helpful for me to detect the overall model fitting based on variational inference, but not that helpful for me to evaluate each parameter (due to the difficulty on getting the marginal posterior distribution for each parameter). But I'm still glad to implement it to evaluate my model.

PSIS might be helpful further to correct the bias. I plan to try it even khat shows a big number ('>0.7'). From the paper, Yao eta la., 2018, it seems OK.

I will use a much smaller threshold (like 0.0001) and a relatively small learning rate (eta as 0.2 around) as the stopping rule in my model for STAN VI to approximate the better stopping rule defined in Dhaka et al., 2020.

@avehtari do you have a plan to update the variational inference in STAN based on Dhaka et al., 2020 paper? It's really great!