paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 183 forks source link

Predict gives near perfect fits in models with high auto-correlation #596

Closed Geomorph2 closed 5 years ago

Geomorph2 commented 5 years ago

Thanks for this great package! I’m relatively new to it and am running into an unexpected issue with an ARMA model in BRMS. The model I’m running is for a paired watershed BACI study. We’ve used GLS in the past, but I’ve been having trouble getting a decent fit on couple of sites so I tried a GAM with autocorrelation in BRMS.

The model looks good but something is up with the predictions. My guess is that the predict function is applying the AR to the data rather than the residuals. I surmise this because the fits get close to what I expect as the AR coefficient decreases, but in any model where I have significant autocorrelation, I get an almost perfect fit to my post-treatment data even if I add in a fake treatment effect.

I apologize for the long example. It took me awhile to get something that was both reproducible and highlighted the issue.

### 300 days of real discharge data from the reference site
ref <- c(5.28, 4.13, 3.05, 2.64, 3.35, 3.35, 3.49, 3.11, 2.73, 2.32, 2.27, 2.38, 2.44, 2.05, 1.85, 1.68, 1.54, 1.48, 1.39, 1.39, 1.98, 10.9, 11.11, 8.35, 7.12, 5.35, 4.47, 10.38, 13.04, 10.3, 7.57, 4.93, 4, 3.61, 4.64, 3.95, 3.38, 2.95, 2.67, 2.39, 2.12, 2.2, 2.4, 2.4, 2.39, 2.32, 3.3, 10.22, 9.74, 8.23, 6.42, 4.95, 3.91, 3.12, 2.56, 2.13, 1.78, 1.52, 1.3, 1.13, 1.01, 0.9, 1, 3.93, 4.84, 19.41, 13.28, 9.79, 11.38, 9.33, 7.72, 6.81, 5.92, 5.04, 4.38, 3.81, 3.36, 6.39, 5.39, 4.91, 4.31, 3.76, 3.13, 2.63, 2.24, 2, 1.88, 2.16, 1.66, 1.49, 1.34, 1.22, 1.1, 0.97, 0.91, 0.88, 0.83, 0.77, 0.72, 0.68, 0.67, 0.75, 0.64, 0.55, 0.41, 0.36, 0.36, 0.32, 0.29, 0.3, 0.4, 3.98, 3.5, 2.48, 2.16, 7.66, 7.89, 8.43, 6.97, 5.47, 4.12, 3.1, 2.33, 1.8, 1.46, 1.18, 0.99, 0.86, 0.74, 0.62, 0.54, 0.52, 0.49, 0.4, 0.36, 0.33, 0.3, 0.27, 0.27, 0.25, 0.28, 0.24, 0.27, 0.4, 0.5, 0.94, 0.7, 0.51, 0.36, 0.33, 0.33, 0.32, 0.33, 0.29, 0.28, 0.25, 0.23, 0.22, 0.2, 0.18, 0.17, 0.16, 0.14, 0.14, 0.14, 0.13, 0.13, 0.11, 0.11, 0.1, 0.1, 0.1, 0.09, 0.09, 0.08, 0.07, 0.08, 0.1, 0.09, 0.08, 0.09, 0.07, 0.07, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.07, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.06, 0.07, 0.06, 0.07, 0.39, 0.11, 0.07, 0.06, 0.06, 0.08, 0.1, 0.08, 0.12, 0.15, 0.18, 0.25, 0.33, 0.33, 0.39, 0.45, 0.58, 0.52, 0.62, 0.46, 0.58, 0.63, 0.82, 0.85, 1.2, 0.26, 0.55, 0.43, 0.25, 3.06, 23.27, 26.16, 25.34, 18.07, 14.53, 9.79, 6.35, 3.76, 3.23, 3.31, 2.88, 2.53, 2.25, 1.98, 1.79, 1.65, 1.59, 1.5, 1.44, 1.4, 1.38, 1.38, 1.36, 1.41, 1.44, 1.45, 1.47, 1.47, 1.5, 1.49, 1.53, 1.61, 1.62, 1.66, 1.64, 2.28, 2.02, 2.03, 2.7, 2.16, 5.93, 5.22, 4.46, 3.97, 3.59, 4.85, 4.3, 4.06, 5.29, 6.75, 8.91, 17.03, 24.63, 16.68, 10.87, 7.71, 5.63, 4.36, 3.66, 3.18, 2.83)
log.ref <- log10(ref)

### Create a treatment discharge that is AR1 with a autocorrelation of 0.9
set.seed(10)
resid <- as.numeric(arima.sim(n = 300, list(ar = c(0.9), ma = c(0)), innov = rnorm(n = 300, mean = 0, sd = 0.1)))
log.trt <-  log.ref + exp(log.ref*0.001) + resid

### Make dataframe
day  <- 1:300
period <- rep(c("Before", "After"), each=(150))
data <-  data.frame(day,period,log.ref,log.trt)

### Plot data
plot(log.ref ~ day, data, cex=0.5, ylim=c(-1.2,4), ylab="Log Q")
lines(log.ref ~ day, data, col="grey")
points(log.trt ~ day, data, cex=0.5, col="blue") #Treatment
lines(log.trt ~ day, data, col="blue") #Treatment 

### XY Plot
plot(log.ref ~ log.trt, data)

### Create the pre-treatment dataset
pre <- data[1:150,]

### Add large treatment effect to the post-treatment data at day 151
fake.trt <- data
fake.trt$log.trt[151:300] <- data$log.trt[151:300] + 1
plot(log.ref ~ day, data, cex=0.5, ylim=c(-1.2,4), ylab="Log Q")
lines(log.ref ~ day, data, col="grey")
points(log.trt ~ day, data, cex=0.5, col="blue") #Treatment
lines(log.trt ~ day, data, col="blue") #Treatment 
lines(log.trt ~ day, fake.trt, col="magenta") #Fake treatment effect

### Example in MGCV with autocorrelation
library(mgcv)
m.gamm <- gamm(log.trt ~ s(log.ref), data=pre, correlation = corARMA(form = ~ day, p=1, q=0))
#m.gamm <- gamm(log.trt ~ s(log.ref), data=pre, correlation=corAR1(form = ~ day))
summary(m.gamm)
summary(m.gamm[[1]])
summary(m.gamm[[2]])
data$pred.gam <- predict(m.gamm[[2]], fake.trt)
plot(log.ref ~ day, data, cex=0.5, ylim=c(-1.2,4), ylab="Log Q")
lines(log.ref ~ day, data, col="grey")
points(log.trt ~ day, data, cex=0.5, col="blue") #Treatment
lines(log.trt ~ day, data, col="blue") #Treatment 
lines(pred.gam ~ day, data, col="red", lwd=2)

### MGCV without autocorrelation
m.gam.nc <- gam(log.trt ~ s(log.ref), data=pre)
summary(m.gam.nc)
data$pred.gam.nc <- predict(m.gam.nc, fake.trt)
lines(pred.gam.nc ~ day, data, col="orange", lwd=2)

### BRMS with autocorrelation
library(brms)
m.brm <- brm(log.trt ~ s(log.ref), data=pre, autocor = cor_arma( ~ day, p=1), cores = 4, seed = 30,
             iter = 4000, warmup = 1000, thin = 10, refresh = 0,
             control = list(adapt_delta = 0.9999, max_treedepth = 15))
summary(m.brm)

predicted.brm <- predict(m.brm, fake.trt)
data$pred.brm <- predicted.brm[,1]

plot(log.ref ~ day, data, cex=0.5, ylim=c(-1.2,4), ylab="Log Q")
lines(log.ref ~ day, data, col="grey")
points(log.trt ~ day, data, cex=0.5, col="blue")  #Treatment
lines(log.trt ~ day, data, col="blue")  #Treatment 
lines(log.trt ~ day, fake.trt, col="magenta")
lines(pred.brm ~ day, data, col="red", lwd=2)  #This is the issue
**This is the issue.  I shouldn't be getting perfect fits to my post-treatment data where I've added an artificial treatment effect.  The predictions should still be following the blue line like they do in mgcv.  It does this with or without the smoothing function.  The behavior seems entirely dependent on the amount of autocorrelation. **

### BRMS without autocorrelation
m.brm.nc <- brm(log.trt ~ s(log.ref), data=pre, cores = 4, seed = 30,
                iter = 4000, warmup = 1000, thin = 10, refresh = 0,
                control = list(adapt_delta = 0.9999, max_treedepth = 15))
summary(m.brm.nc)

predicted.brm.nc <- predict(m.brm.nc, fake.trt)
data$pred.brm.nc <- predicted.brm.nc[,1]

lines(pred.brm.nc ~ day, data, col="orange", lwd=2)

### BRMS without smooth but with autocor
m.brm.lin <- brm(log.trt ~ log.ref, data=pre, autocor = cor_arma( ~ day, p=1), cores = 4, seed = 30,
             iter = 4000, warmup = 1000, thin = 10, refresh = 0,
             control = list(adapt_delta = 0.9999, max_treedepth = 15))
summary(m.brm.lin)
predicted.brm.lin <- predict(m.brm.lin, fake.trt)
data$pred.brm.lin <- predicted.brm.lin[,1]

plot(log.ref ~ day, data, cex=0.5, ylim=c(-1.2,4), ylab="Log Q")
lines(log.ref ~ day, data, col="grey")
points(log.trt ~ day, data, cex=0.5, col="blue") #Treatment
lines(log.trt ~ day, data, col="blue") #Treatment 
lines(log.trt ~ day, fake.trt, col="magenta")
lines(pred.brm.lin  ~ day, data, col="red", lwd=2)
paul-buerkner commented 5 years ago

Do you want to do out-of-sample predictions on your new data? By default, brms performs in-sample predictions leading to excellent predictions as you pointed out. To specify the observations, which should be treated as out-of-sample, please use the oos argument of predict. You need brms 2.7.0 or higher for this to work.

If you are not sure whether your issue is a bug or rather a question about the usage of brms, I would recommend asking on https://discourse.mc-stan.org first before posting an issue.

Geomorph2 commented 5 years ago

Thanks Paul. I'm using brms_2.7.1 now. I looked through the help and could only find oos in extract_draws.brmsfit; nothing for predict.brmsfit. I think my issue is much simpler than you imagine.

I fit the same simple model with lm, gls, gam, and brms. The lm, gls, and gam predictions (with and without AR where available) follow the trend in the independent variable for newdata. But for some reason, when I include high AR, predict.brmsfit follows the trend in the dependent variable of newdata almost perfectly, regardless of what the model coefficients are.

If I drop the AR term from the model, predict.brmsfit goes back to following the trend in the independent variable and I get predictions similar to those from lm, gls, and gam.

paul-buerkner commented 5 years ago

predict.brmsfit passes arguments to extract_draws as documented in ?predict.brmsfit. What happens if you specify all new observations as oos?

paul-buerkner commented 5 years ago

How does predicted.brm <- predict(m.brm, fake.trt, oos = 151:300) work for you?

Geomorph2 commented 5 years ago

Thanks Paul. I was struggling to figure out how to specify the optional indices of observations for oos. The good news is that I now get the same result regardless of whether I use the original data or fake.trt as the newdata. Thanks! The bad news is that the fits are really bad in the post treatment period (151:300) compared with gls or gamm. I don't know why the post-treatment fits are so bad, but maybe it's a stan or priors issue. The model coefficients are off in the brms models with AR, especially the AR coef (stan estimates AR[1]=0.98) when it's when it's actually 0.9 in the simulated data. For what it's worth, my real data was much easier to fit than this toy example.

I'm all ears if you have any suggestions for making the model work; otherwise I can stick to gls and gamm for the time-being. Thanks again this is an amazing package and sorry for my rather ignorant questions.

paul-buerkner commented 5 years ago

I will take a closer look to find out what is going wrong here.

Just on quick comment unrelated to your particular issue: Don't use thin in Stan in 99.9% of the cases. Most likely you will waste a lot of information.

paul-buerkner commented 5 years ago

I investigated the issue more closely, and it really seems to only occur for autocorrelations which are very close to 1 or -1. For instance, when I usee ar = 0.8 (or lower), predictions of brms closely resemble those of mgcv. I am closing this issue as I think the original problem has been resolved.

paul-buerkner commented 5 years ago

After staring at the examples for quite some time, I think the problem is that the time-series of the residuals is initially non-stationary when sigma is small and AR is large, which induces a positive bias in the AR estimate in the brms / stan implementation. Once you remove the initial part non-stationary part of the time-series, estimates are much more reasonable. Here is an example:

set.seed(123)
y <- 5 + as.numeric(arima.sim(
    n = 500, list(ar = c(0.9)),
    innov = rnorm(n = 500, mean = 0, sd = 0.1)
))
df <- data.frame(y, t = seq_along(y))
# we see an initial trend in the data
plot(df$t, df$y)

# arima recoves the parameters nicely
fit_arima1 <- arima(y, order = c(1, 0, 0))
fit_arima1

# still looks good
fit_arima2 <- arima(y[101:500], order = c(1, 0, 0))
fit_arima2

# let's try it with Stan
stan_code <- "
data {
  int<lower=1> N;  // total number of observations
  vector[N] Y;  // response variable
}
parameters {
  real intercept;  // intercept
  real<lower=0> sigma;  // residual SD
  real<lower=-1,upper=1> ar;  // autoregressive effect
}
model {
  vector[N] mu = intercept + rep_vector(0, N);
  vector[N] e;
  e[1] = Y[1] - mu[1];
  target += normal_lpdf(Y[1] | mu[1], sigma);
  for (n in 2:N) {
    e[n] = Y[n] - mu[n];
    target += normal_lpdf(Y[n] | mu[n] + ar * e[n-1], sigma);
  }
}"

library(rstan)
sdata1 <- list(N = 500, Y = y)
fit1 <- stan(model_code = stan_code, data = sdata1)
fit1
# intercept is underestimated
# ar is overestimated (close to 1)

sdata2 <- list(N = 400, Y = y[101:500])
fit2 <- stan(model_code = stan_code, data = sdata2)
fit2
# looks good

library(brms)
fit3 <- brm(y ~ 1, data = data.frame(y), autocor = cor_ar(~1))
fit3
# intercept is underestimated
# ar is overestimated (close to 1)

fit4 <- brm(y ~ 1, data = data.frame(y = y[101:500]), 
            autocor = cor_ar(~1))
fit4
# looks good
paul-buerkner commented 5 years ago

You may also set cov = TRUE in cor_arma(), which is much slower when fitting the model but seems to yield more robust results if AR is close to 1.

Geomorph2 commented 5 years ago

Thanks Paul. Most of our daily discharge and water temperature time-series have fairly high residual auto-correlation. We typically end up with ARMA models where at least one of the AR coefficients is in the 0.82-0.93 range. We'll stick with gls for those time-series models for now, but I'll keep running everything in brms as well and will try setting cov=T. Your help and insight is much appreciated!