greta-dev / greta

simple and scalable statistical modelling in R
https://greta-stats.org
Other
530 stars 63 forks source link

Could calculate return a custom class, "greta_calculate" or something? #481

Open njtierney opened 2 years ago

njtierney commented 2 years ago

This might make it easier to do tidying on the output, so we can avoid steps like this:

penguins_prediction <- sims$probability_female_pred[, , 1] %>%
  t() %>%
  as_tibble() %>%
  set_names(paste0("sim_", seq_len(n_sims))) %>%
  bind_cols(
    penguins_for_prediction,
    .
  ) %>%
  pivot_longer(
    cols = starts_with("sim"),
    names_to = "sim",
    values_to = "probability_female",
    names_prefix = "sim_"
  )

Perhaps integrating with either mmcc or tidybayes

hrlai commented 2 years ago

Hi @njtierney , on a related note is that calculate returns different object class depending on whether we're doing prior or posterior predictive checks.

library(greta)
#> 
#> Attaching package: 'greta'
#> The following objects are masked from 'package:stats':
#> 
#>     binomial, cov2cor, poisson
#> The following objects are masked from 'package:base':
#> 
#>     %*%, apply, backsolve, beta, chol2inv, colMeans, colSums, diag,
#>     eigen, forwardsolve, gamma, identity, rowMeans, rowSums, sweep,
#>     tapply

# a simple Bayesian regression model for the iris data
# priors
int <- normal(0, 5)
#> ℹ Initialising python and checking dependencies, this may take a moment.
#> ✓ Initialising python and checking dependencies ... done!
#> 
coef <- normal(0, 3)
sd <- lognormal(0, 3)

# likelihood
mean <- int + coef * iris$Petal.Length
distribution(iris$Sepal.Length) <- normal(mean, sd)

# build and sample
m <- model(int, coef, sd)
draws <- mcmc(m)

# calculate values

coef_posterior <- calculate(coef, values = draws)
class(coef_posterior)
#> [1] “greta_mcmc_list” “mcmc.list”

coef_posterior_sim <- calculate(coef, values = draws, nsim = 1000)
class(coef_posterior_sim)
#> [1] “list”

coef_prior_sim <- calculate(coef, nsim = 1000)
class(coef_prior_sim)
#> [1] “list”

Created on 2022-03-28 by the reprex package (v2.0.1)

Since the different object also arrange the MCMC samples in different dimensions, it is not easy to apply the same code (e.g., involving tidybayes) to them downstream. Specifically, I like that when nsim is specified, calculate returns a list of arrays with the same dimensions as the original greta_array. I.e.,

> dim(coef_posterior_sim$coef)   # first dim is number of MCMC samples
[1] 1000    1    1
> dim(coef)
[1] 1 1

But it isn't easy to recover the dimension with the posterior stored in mcmc.list:

> dim(as.matrix(coef_posterior))
[1] 4000    1

This becomes more different when coef has more than one row or more than one column, because they are vectorised before being saved as a mcmc.list (right?) That said, the format of mcmc.list makes it readily integrated into the tidybayes syntax/grammar, e.g, the latter extract parameters as coef[row, col]. If we decide to favour a greta_calculate object having the same dimensions as the input greta_array, then we need some auxillary function that turns the former into a tidybayes and mcmc compatible thing...

njtierney commented 2 years ago

Hi @hrlai !

Thanks so much for this - great points, I think ideally want to be able to use the same/similar workflow for prior and posterior predictive checks, right? RE the list of arrays having the same dimension as the original greta array, I feel like I would prefer to have it return mcmc.list in both cases and then have ways to transform that output into other common formats? What do you think?

If you've got some examples of workflow where you run into these problems I'm happy to help test out these new features with you, at the moment the priority for the next release is getting everything to work with M1 and/or TF2.0, so we might not come back to this issue until we solve that first, just so you've got an idea of timelines/order or operations.

Standardising mcmc output is something I thought about a bit with https://github.com/njtierney/mmcc - which might still get some more development, but I see that tidybayes is a lot more mature now. Just thought I'd plug mmcc in case the tidying functions there are useful, it deliberately does less than tidybayes, the idea being that one you've got data in a standard format you can use data.table or dplyr to do the munging.

hrlai commented 2 years ago

have it return mcmc.list in both cases and then have ways to transform that output into other common formats? What do you think?

Ah, I just remembered that there are plot and other methods for mcmc.list, so yeah keeping it in this format will be better than returning greta_array-like outputs. People who have been using coda should be more familiar, making greta a more welcoming change from JAGS/BUGS and the likes... I also just realised the bayesplot functions take in mcmc.list too, D'oh!

By the way, I like how lightweight mmcc is and might give it a try as an alternative to tidybayes :)