stan-dev / rstanarm

rstanarm R package for Bayesian applied regression modeling
https://mc-stan.org/rstanarm
GNU General Public License v3.0
388 stars 133 forks source link

kfold seems to work incorrectly (without a warning) when the data is of certain form #162

Closed paasim closed 7 years ago

paasim commented 7 years ago

Summary:

The kfold-function in rstanarm seems to work incorrecly when data is not provided as a data.frame.

Description:

See the example code below. When data is provided as in fit2 and fit3, for some reason kfold only uses the first column of input data (x) instead of all the columns. I first noticed this when experimenting with the function I provided in #161, but below is an example that demonstrates the problem without the function in #161.

This is kind of indirect, but running the code below you can see, that performing kfold for fit1 is about twice as fast as for fit2 or fit3, even though they should do the same thing. In addition, the elpds in kf2 and kf3 agree while the elpd in kf1 is quite different. If in the example y is set to depend on the first column of x, the resulting elpds will be pretty similar (although the difference in the fitting time remains), because then the only column that is included happens to be the only one that matters for elpd.

I believe this is related to the update.stanreg-function called inside kfold (although calling the update-function directly does not seem to cause problems). At least for now I do not how to fix this (other than calling the function only when the data is of 'correct' form).

Reproducible Steps:

set.seed(1)
n <- 100
d <- 10
x <- MASS::mvrnorm(n, rep(0, d), diag(rep(1, d)))

# y is determined by the second column of x
y <- rnorm(n, 3*x[, 2])

# this will work with kfold
data1 <- data.frame(x = I(x), y = y)
# this will not work with kfold
data2 <- list(x = x, y = y)

fit1 <- rstanarm::stan_glm(y~x, data = data1, seed = 1)
fit2 <- rstanarm::stan_glm(y~x, data = data2, seed = 1)
# this will not work with kfold
fit3 <- rstanarm::stan_glm(y~x, seed = 1)

kf1 <- rstanarm::kfold(fit1, K = 2)
kf2 <- rstanarm::kfold(fit2, K = 2) # 2x faster than kfold(fit1, K = 2).
kf3 <- rstanarm::kfold(fit3, K = 2) # 2x faster than kfold(fit1, K = 2).

# kf1 differs from kf2 and kf3 a lot.
kf1$elpd_kfold
kf2$elpd_kfold 
kf3$elpd_kfold

RStanARM Version:

2.14.1

R Version:

3.3.2

Operating System:

macOS 10.12.3

jgabry commented 7 years ago

Thanks Markus. I think this is related to some other issues we've been discussing recently related to data specification. In order for update and anything that calls update to be guaranteed to work properly we might need to restrict the ways data can be specified. I'll look into your particular examples and make sure it's the same problem that's going on here.

On Thu, Feb 9, 2017 at 9:34 AM Markus Paasiniemi notifications@github.com wrote:

Summary:

The kfold-function in rstanarm seems to work incorrecly when data is not provided as a data.frame. Description:

See the example code below. When data is provided as in fit2 and fit3, for some reason kfold only uses the first column of input data (x) instead of all the columns. I first noticed this when experimenting with the function I provided in #161 https://github.com/stan-dev/rstanarm/issues/161, but below is an example that demonstrates the problem without the function in #161 https://github.com/stan-dev/rstanarm/issues/161.

This is kind of indirect, but running the code below you can see, that performing kfold for fit1 is about twice as fast as for fit2 or fit3, even though they should do the same thing. In addition, the elpds in kf2 and kf3 agree while the elpd in kf1 is quite different. If in the example y is set to depend on the first column of x, the resulting elpds will be pretty similar (although the difference in the fitting time remains), because then the only column that is included happens to be the only one that matters for elpd.

I believe this is related to the update.stanreg-function called inside kfold (although calling the update-function directly does not seem to cause problems). At least for now I do not how to fix this (other than calling the function only when the data is of certain form). Reproducible Steps:

set.seed(1) n <- 100 d <- 10 x <- MASS::mvrnorm(n, rep(0, d), diag(rep(1, d)))

y is determined by the second column of x

y <- rnorm(n, 3*x[, 2])

this will work with kfold

data1 <- data.frame(x = I(x), y = y)

this will not work with kfold

data2 <- list(x = x, y = y)

fit1 <- rstanarm::stan_glm(y~x, data = data1, seed = 1) fit2 <- rstanarm::stan_glm(y~x, data = data2, seed = 1)

this will not work with kfold

fit3 <- rstanarm::stan_glm(y~x, seed = 1)

kf1 <- rstanarm::kfold(fit1, K = 2) kf2 <- rstanarm::kfold(fit2, K = 2) # 2x faster than kfold(fit1, K = 2). kf3 <- rstanarm::kfold(fit3, K = 2) # 2x faster than kfold(fit1, K = 2).

kf1 differs from kf2 and kf3 a lot.

kf1$elpd_kfold kf2$elpd_kfold kf3$elpd_kfold

RStanARM Version:

2.14.1 R Version:

3.3.2 Operating System:

macOS 10.12.3

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/stan-dev/rstanarm/issues/162, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4QyyPrOgLP6B0zsWLV9GleTK6JqHfks5rayQPgaJpZM4L8M7a .

jgabry commented 7 years ago

@paasim I think it's going to be hard to guarantee that kfold will work correctly unless the data argument to the initial modeling function is a data frame. The problem here is that all sorts of small things are different when the data is specified as a data frame vs as a list vs as an environment (i.e. no data argument). I think in your examples what's happening is that x is being treated properly in the first case but in the other cases only the first column of x is being used in kfold. That's why the answers are different and why it runs faster.

If you want to see exactly what's happening, try this:

d1 <- rstanarm:::kfold_and_reloo_data(fit1)
d2 <- rstanarm:::kfold_and_reloo_data(fit2)
d3 <- rstanarm:::kfold_and_reloo_data(fit3)

and then check out how the structure of d1 and the variable names are very different from d2 and d3.

It isn't hard to fix kfold_and_reloo_data on a case by case basis but in general it's hard to always get this right when the user can specify data in so many ways (this is an issue for lots of modeling packages in R, e.g. check out the warning in the doc for the data argument for the lmer function in lme4 at help("lmer", package = "lme4")). That's part of the reason I want to either require using a data frame always or not allow using functions like kfold if a data frame wasn't used.

paasim commented 7 years ago

Right, this is kind of what I anticipated. And I agree that it doesn't make much sense to make adhoc fixes for cases like the ones I described.

But anyway thanks for a quick reply, this is good to know in the future.