hyunjimoon / SBC

https://hyunjimoon.github.io/SBC
Other
53 stars 4 forks source link

Error in `dplyr::select()` #84

Open spinkney opened 1 year ago

spinkney commented 1 year ago

I'm not sure how to proceed after getting this error

Simulation 1 resulted in error when post-processing the fit.
Calling `recompute_SBC_statistics` after you've found and fixed the problem could let you move further without refitting
Error in `dplyr::select()`:
! Can't subset columns that don't exist.
✖ Column `variable` doesn't exist.

With this code

library(SBC) # remotes::install_github("hyunjimoon/SBC")
library(cmdstanr)
library(MCMCpack)
library(ggplot2)

library(future)
plan(multisession)
options(SBC.min_chunk_size = 5)

m <- cmdstan_model("softmax.stan")
backend <- SBC_backend_cmdstan_sample(m, chains = 2)

generate_one_dataset <- function(N, K, prior_alpha = 1) {
    x <- rdirichlet(1, alpha = rep(prior_alpha, K))
    observed <- as.integer(rmultinom(1, size = N, prob = x))

    list(
        variables = list(x = x),
        generated = list(K = K, observed = observed, prior_alpha = prior_alpha)
    )
}

y <- generate_one_dataset(N = 100, K = 20, prior_alpha = 4)

y_out <- m$sample(
    data = list(
        K = y$generated$K,
        observed = y$generated$observed,
        prior_alpha = y$generated$prior_alpha
    ),
    parallel_chains = 4
)

datasets_flat <- generate_datasets(
    SBC_generator_function(generate_one_dataset, N = 100, K = 20),
    n_sims = 10)

res_flat <- compute_SBC(datasets_flat, backend)

and Stan model

data {
  int K;
  array[K] int<lower=0> observed;
  real<lower=0> prior_alpha;
}

parameters {
  vector[K - 1] y;
}

transformed parameters {
  simplex[K] x = softmax(append_row(y, 0.));
}

model {
  x ~ dirichlet(rep_vector(prior_alpha, K));
  observed ~ multinomial(x);
}
martinmodrak commented 1 year ago

Thanks for reporting! The underlying problem is that if you look at datasets_flat$variables, you'll notice that it treats x as 2D array, so the variables are called x[1,1], x[1,2] ... , while the variables in Stan fit are called x[1], x[2]. SBC then finds no shared variables and this leads to problems. This deserves a better error message.

However, you can get the code to work by changing to:

generate_one_dataset <- function(N, K, prior_alpha = 1) {
  x <- as.numeric(rdirichlet(1, alpha = rep(prior_alpha, K)))
  observed <- as.integer(rmultinom(1, size = N, prob = x))

  list(
    variables = list(x = x),
    generated = list(K = K, observed = observed, prior_alpha = prior_alpha)
  )
}

Also note that the $generated element of datasets is already in format to be pushed to Stan, so you can simplify your test with:

y_out <- m$sample(
  data = y$generated,
  parallel_chains = 4
)

I'll try to inform messages around this issue...

spinkney commented 1 year ago

Thanks for the quick response @martinmodrak!