stan-dev / posteriordb

Database with posteriors of interest for Bayesian inference
161 stars 26 forks source link

Proposal: Add `data-used` to posterior `.json` files where relevant #260

Open JasonPekos opened 3 weeks ago

JasonPekos commented 3 weeks ago

Proposal:

modify the posterior .json files to specify what data from the dataframe is actually used as an input to the model.

Rationale:

Some models only use a subset of their data. For example,earn-height uses the earnings data:

   N    =>    1192                   
  earn  =>  [50000, 60000, 30000,...        
  height => [74, 66, 64, 63, 63, 64,...      
  male  =>  [1, 0, 0, 0, 0, 0, 0,...

Of this data, earn-height only uses a subset: N, earn, height. This is fine for Stan, which will automatically discard data that doesn't match variables defined in the data block.

Unfortunately, this is frustrating when trying to port PosteriorDB models to other PPLs. Many PPLs — notably Turing, but I think also PyMC, NumPyro, Gen, and so on — use some sort of overloaded function definition to define a probabilistic program, e.g.:

# generic-ppl-pseudocode:

@make_model function model_name(data_1, data_2, data_3){
     prior ~ dist()
     data_1 ~ dist(prior, smth ...)
}

In this setup, the data arguments need to exactly match the columns of the dataframe, and so the dataframe must be filtered beforehand to extract the relevant columns. To make this easier, it would be helpful to have a column in the dataframe specifying data-used.


Example addition:

{
  "name": "earnings-earn_height",
  "keywords": ["arm book", "stan examples"],
  "urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
  "model_name": "earn_height",
  "data_name": "earnings",
  "reference_posterior_name": "earnings-earn_height",
  "references": "gelman2006data",
  "dimensions": {
    "beta": 2,
    "sigma": 1
  },
  "added_date": "2020-01-17",
  "added_by": "Oliver Järnefelt"
}

would become:

{
  "name": "earnings-earn_height",
  "keywords": ["arm book", "stan examples"],
  "urls": "https://github.com/stan-dev/example-models/tree/master/ARM/Ch.4",
  "model_name": "earn_height",
  "data_name": "earnings",
  "data_used": ["N", "earn", "height]            # <--------------- the change is here
  "reference_posterior_name": "earnings-earn_height",
  "references": "gelman2006data",
  "dimensions": {
    "beta": 2,
    "sigma": 1
  },
  "added_date": "2020-01-17",
  "added_by": "Oliver Järnefelt"
}

This change would only need to occur for models where the provided dataframe is a superset of the actual dataframe.