Vignette for large/complex model workflow

martinmodrak commented 2 years ago

Main ideas:

Can start with ADVI/optimization
Can (sometimes) use reduced data/model (but then reduced validity, still better than nothing)
- I.e. just one random effect, just three states in a HMM, ...
How to learn something even from a single fit? (will need some additional code/support)
- Merge all the ranks together
- z-scores (or equivalent)

hyunjimoon commented 2 years ago

I would be excited to push this forward if you think one of the following (preferably their intersection) can be considered as complex model from your point of view. Links are related issues for each module.

hyunjimoon commented 2 years ago

@Dashadower will first attack hierarchical and @hyunjimoon will attack high dimension.

hyunjimoon commented 1 year ago

Building on @martinmodrak's small workflow, @tomfid and I will make a document for Bayesian workflow on ODE model for urban dynamics (mdl file).

This satisfies at least two categories above (ODE, high dimension). Moreover, Vicky's research on urban scaling, regressing the log of city index such as congestion or income inequality with log of city population opens up the path toward hierarchical Bayesian modeling. In short, hierarchical Bayesian start from viewing $\beta$ as random variable (hence the name Bayesian regression or Random effect model) and giving a prior distribution. Then, we replace the scalar value of prior parameter to another random variable. This allows multi-level effect of learning from data, placing itself as partial pooling between no-pooling and complete pooling model. This document is a good introduction as it has both Stan code and similar model structure (log-log).

Considering $\beta$ is (averaged) elasticity at fixed time from eq.2.2. in the first paper above, Tom's writing on challenges for informative parameter setting in dynamic models is relevant. Following the excerpt from Bayesian workflow ("a pragmatic idea is to keep the priors and compute reasonable parameter values using the real data. This can be done either through rough estimates or by computing the actual posterior. We then suggest widening out the estimates slightly and using these as a prior for the SBC.") I recommend starting from a tight prior. Absence of prior knowledge or data is an elephant in the room which I aim to address with priordb project where realistic values of parameter that could be assumed are collectively learned.

We will include:

Must

Classifying parameters into three: assumed parameter, assumed parameter time-series, estimated parameter.
Justification of prior specification and its automation
Three checks: prior predictive, posterior predictive, simulation-based calibration

For 2, sections "Multiplicative error and the lognormal distribution, Weakly informative priors, Priors for system parameters and noise scale" from this case study on population dynamics is a good place to start for setting distribution and parameter for prior. This corresponds to "Specify_implicit" (H5.abc) from this Human-Machine collaboration table (HMC table).

We are consider including:

Option

Demand prior elicitation to policy function and its optimization for policy prescription
Comparing different posterior approximator modules (MCMC, variational inference, optimization)

For 4, translating Vensim's .vpd to Stan model block is the key as then we can use its optimization engine like this restaurant revenue optimization example. For 5, the aim is to find the cheaper(-est) computation that reaches conditioned precision (step 9 from HMC table).

Tool

stan_builder I am developing with @Dashadower on @JamesPHoughton's Pysd (currently on stan-backend branch pull-requested here)
One example on prey-predator here which includes 1,2,3 for above.
More complicated example is on stock management here with at least ten assumed parameters (will be updated by 9/5)

Ref

Workflow sequence conditional on data and model by Mike Betancourt here

hyunjimoon commented 1 year ago

@jandraor and I am trying this with three example models in https://github.com/Data4DM/BayesSD/discussions/76. @tomfid's help, especially regarding inferencedata is helpful as vensim supporting this format would be crucial in connecting Vensim subscript with hierarchical Bayesian.

Also, @OriolAbril and @ahartikainen are helping connecting this to arviz. Thanks!

hyunjimoon commented 1 year ago

@martinmodrak @Dashadower, could current SBC R library's output be easily transformed to inferencedata by any chance? Or would there be any reference codes we can refer to e.g. previous attempts of our community to connect posterior and arviz? @Jandraor and I are using different language (R, Python) and wondered whether we can pool our efforts in plots by having a modularized data structure.

ahartikainen commented 1 year ago

cc @mike-lawrence

OriolAbril commented 1 year ago

this issue https://github.com/stan-dev/posterior/issues/85 sounds relevant to interoperability

hyunjimoon commented 1 year ago

Below is rough plan which I felt needed for large model workflow. Enjoyable milestone is Bayesian workflow dynamic model casestudy on prey-predator, SEIR, inventory management by around March, 2023. Thank you very much, all!

Goal: bridging Vensim ecosystem with Stan ecosystem

provide efficient (gradient-based) and effective (diagnostics) HMC-estimator to dynamic model (generator)
consistency among data(.nc), model (.stan), plots (.png)
template for simulation-based calibration checks e.g. this python file

For this, I am trying to

connect stanify with Dynamic simulation scenario 1,2 (with @tomfid, @enekomartinmartinez, @tseyanglim, @JamesPHoughton's support)
1's result by putting many .nc files into one sbc.nc (with @Dashadower, @OriolAbril, @ahartikainen's support)
connect .nc output with SBC package via rvar concept (with @paul-buerkner, @martinmodrak, @jandraor's support)

Dynamic simulation scenario (Vensim)

outputs netcdf format (.nc). Scenarios to reach .nc.

scenario 1) Vensim/Stella user on Python

pysd translates .mdl, .xml to python objects (support Stella, which stanify lacks now)
pysd can output synthetic data in generator.nc
@enekomartinmartinez, @JamesPHoughton maintains pysd

scenario 2) Pure Vensim user

Vensim .mdl is considering supporting .nc format, if this happens, it can output generator.nc and estimator.nc
Vensim has internal MCMC, but wish to connect to HMC
@tomfid is Vensim CTO and @tseyanglim has expertise in connecting Vensim hierarchical model with python

stanify(scenario 1 or 2)

stanify translates .mdl to .stan and outputs one generator.nc and estimator.nc for baseline case (no hierarchy, no prior_draw's')
stanify outputs one generator.nc and three (n_prior_draws) number of estimator.nc for SBC
stanify outputs one generator.nc and two (n_subgroups) number of estimator.nc for hierarchical model

@Dashadower is SBC developer, @OriolAbril, @ahartikainen are arviz developer
arviz's wrapper or multi-index for posterior group can streamline 2,3 which can be added on @OriolAbril's sbc-cmdstanpy

Computational statistician (Stan)

transform .nc to rvars which SBC package supports. Three verifications needed:

aria code translates .nc to rvars here (@mike-lawrence, could you confirm this?)
SBC supports rvars here (@martinmodrak, could you confirm this?)
rvar-based Bayes visualization + empirical coverage plot is explainable enough for policy specification on dynamic model (@jandraor, could you confirm this?). Matthew's rvars explains how rvar can make visualization easy by grouping random variable which may be relevant to inferencedata's data variable concept.

mike-lawrence commented 1 year ago

The nc-to-rvars conversion is done by the class_composition R6 class here, specifically the nc_to_rvar method here

hyunjimoon / SBC