stan-dev / cmdstanr

CmdStanR: the R interface to CmdStan
https://mc-stan.org/cmdstanr/
Other
144 stars 63 forks source link

non-numeric unused data objects give an error if passed to `sample()` #330

Closed michael-chong closed 4 years ago

michael-chong commented 4 years ago

Describe the bug The sample() method gives an error when objects in the data list are non-numeric, even if the object is not used in the model.

To Reproduce

library(cmdstanr)

data_list <- list(
  N = 10,
  y = rnorm(10),
  extra_stuff = "a"
)

mod <- cmdstan_model(write_stan_file("

  data {
    int<lower=0> N;
    vector[N] y;
  }

  parameters {
    real mu;
    real<lower=0> sigma;
  }

  model {
    y ~ normal(mu, sigma);
  }

"))

fit <- mod$sample(
  data = data_list,
  seed = 1,
  chains = 4,
  parallel_chains = 4,
  iter_warmup = 1000
)

The above code gives: Error: Variable 'extra_stuff' is of invalid type.

Replacing extra_stuff with 1 for example allows the model to run with no issues.

Expected behavior As in rstan (I believe), only objects used in the model are required to be a certain type.

Operating system macOS 10.15.7

CmdStanR version number cmdstanr version 0.1.3 cmdstan version 2.25.0

jgabry commented 4 years ago

Thanks for reporting this. I have mixed feelings about this. It's weird for CmdStanR and RStan to behave differently here but if there's an object in the list that is invalid (e.g., an R character vector could never be valid input to Stan) then what's the use case for including it in the list passed to Stan? I can definitely see a use case for including valid objects in the list that aren't used for a particular model (they could be used for other models), but I'm having trouble imagining why I would ever need to include an object in the list that can never be used for any model. So can you share a bit more about your use case for this? I'm happy to be persuaded!

michael-chong commented 4 years ago

For me it can be convenient if I want to fit the model on different data sets, and I want to embed some meta-information about the data. There are two specific cases I have in mind:

It's usually a bit redundant, since this information exists elsewhere in the workflow, and if not I could always tag this information onto an object I save later, but I'd like it if the data lists could be more self-contained and portable.

jgabry commented 4 years ago

Thanks, that makes sense. Keeping the data lists more self-contained and portable seems like a good use case.

mike-lawrence commented 4 years ago

Do we strip off any of the attributes of the data object supplied by the user? I've been using attributes as a metadata scheme in my projects elsewhere, so maybe we can recommend that here too.

mike-lawrence commented 4 years ago

Oh, one idea is that you could use the attributes of the fit object to store your metadata about the fit

fit = mod$sample(...)
attr(fit,'metadata') = my_metadata

That would seem to be more straightforward in that it keeps data and metadata as clearly separate things, no?

jgabry commented 4 years ago

Do we strip off any of the attributes of the data object supplied by the user? I've been using attributes as a metadata scheme in my projects elsewhere, so maybe we can recommend that here too.

I think this could work even if we strip off the attributes because:

data <- list(...) # assume the list or individual elements have attributes
fit <- model$sample(data)  # suppose we strip attributes internally 
data # still has attributes

The only reason the attributes would get lost is if we removed them, then saved the data object inside the returned CmdStanMCMC object, and then it was that data object that was subsequently used. But we don't actually save the data in the object, so the original data list is what the user still has.

Does that make sense? Or did I misunderstand your idea?

mike-lawrence commented 4 years ago

I was actually thinking there was a way to retrieve the data from the fitted object, but now I see that's not the case, so setting an attribute on the data object supplied to sample or the resulting fit would both work.

jgabry commented 4 years ago

Yeah, with rstanarm and brms you can get the data from the fitted model object, but we don't do that for rstan and cmdstanr. Occasionally I wish we did, but usually I think it's better to avoid storing another copy of the data.

michael-chong commented 4 years ago

Thanks @mike-lawrence, I hadn't thought of setting attributes on the data list; I think that's a good idea and alternative. Setting attributes on the fitted object might not achieve the same goal if, for instance, the data needs to be used somewhere else.

I'll certainly make use of attributes on the data list, though, and I also like that it distinguishes metadata.

mike-lawrence commented 4 years ago

You can even add the data as an attribute to the fit object if you want to be succinct!

jgabry commented 4 years ago

Coming back to this to take stock of where we're at. I think @michael-chong's use case for passing invalid objects in is a good one, but if @mike-lawrence's suggestion of using attributes instead is sufficient then maybe we don't need to change anything here? What does everyone think? Thanks Michael and Mike for bringing this up and suggesting an alternative, respectively.

michael-chong commented 4 years ago

I think using attributes is sufficient for my needs. Thanks @jgabry and @mike-lawrence for the help!

rok-cesnovar commented 4 years ago

Thanks!