paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.27k stars 183 forks source link

MRF smooth error #128

Closed wfolta closed 7 years ago

wfolta commented 7 years ago

I'm trying to use the Markov Random Field smoother to do something geostatistical. gamm understands this smoother, while gamm4 does not. It appears that brm does, or at least wants to:

> str (xt, max.level=1)
List of 1
 $ polys:List of 270
  .. [list output truncated]
> str (head (xt$polys))
List of 6
 $ S02000260: num [1:178, 1:2] 260718 260685 260704 260855 260928 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
 $ S02000261: num [1:105, 1:2] 262047 262043 262059 262087 262119 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
 $ S02000262: num [1:170, 1:2] 261159 261179 261246 261259 261501 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
 $ S02000263: num [1:90, 1:2] 253771 253766 253754 253873 253884 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
 $ S02000264: num [1:285, 1:2] 260183 260171 260171 260184 260183 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
 $ S02000265: num [1:96, 1:2] 253252 253305 253353 253358 253362 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:2] "X" "Y"
> g3 <- brm (logprice ~ s(IG, bs="mrf", k=20, xt=xt) + s(crime) + rooms + sales + type + logdriveshop, data=propertydata.spatial@data)
Error in model.frame.default(effects$all, data = data, na.action = na.pass,  : 
  invalid type (list) for variable 'xt'
> g3 <- brm (logprice ~ s(IG, bs="mrf", k=20) + s(crime) + rooms + sales + type + logdriveshop, data=propertydata.spatial@data)
Error in smooth.construct.mrf.smooth.spec(object, dk$data, dk$knots) : 
  penalty matrix, boundary polygons and/or neighbours list must be supplied in xt

Should this work? I've been doing some work with geospatial models and the gold standard is supposed to be CAR models, followed by GAMM models, but of course to work the GAMM needs to support MRF smoothing -- at least if we intend to do areal calculations rather than point calculations.

[Love the improvements, again, in brms! Excited that areal geospatial models might be possible.]

paul-buerkner commented 7 years ago

I am not sure if this should work. Since brms currently uses a gamm4 like implementation, I assume it should not, but I have to go into the details to understands this thoroughly.

The error we see, however, is a bug in the parsing of smoth terms. I will try to fix it today and then we can try again to get the MRF smoothing working.

wfolta commented 7 years ago

Correct, gamm4 doesn't support the "mrf" smooth, so I assume you also would not. In gamm4's case, it has to do with their dependence on lmer, which they use under the hood via some kind of conversion trick. The ultimate solution for brms is -- as you mention in the documentation -- via the autocor= and a cor_brms. But it may be that brm could sense the bs="mrf" and say, "brms doesn't yet support spatial autocorrelation, but when it does it will use the autocor= mechanism".

Or maybe you actually could use s's "mrf" smooth... It automatically handles things like turning polygons into an adjacency matrix and so on, so it has some useful magic built-in. If it fits into your schema and provides similar power to a cor_brms solution.

paul-buerkner commented 7 years ago

You are right. First, I have to get rid of the current parsing bug when specifying argument xt and then I will dig deeper into the MRF stuff. I think it is definitely worth implementing this sooner or later.

paul-buerkner commented 7 years ago

I think the bug has been fixed and the MRF smooth seems to work now in brms. At least for the sample data from the doc of mgcv, I get very similar estimates as compared to gamm, but of course we don't get such nice looking maps. ;-)

I would be grateful if you tried it out using the dev version from github.

I noticed that this example also works with gamm4 and I thus wonder if I missed something. Didn't you say gamm4 does not support MRF?

wfolta commented 7 years ago

Odd, I just ran gam (from mgcv) and it works fine with the xt I'm using. When I use gamm4 (from gamm4) with the same arguments, I get an error:

Error in if (as.numeric(vr[[i]] > 0)) sp[k] <- scale/as.numeric(vr[[i]]) else sp[k] <- 1e+10 : 
  argument is not interpretable as logical
In addition: Warning message:
In sqrt(D) : NaNs produced

Which I thought I'd traced down to gamm4 using lme4 under the hood, and only supporting a limited subset of smoothers -- univariate only? -- but maybe it's just the particular xt I tested with. (I'm using data from CARBayesdata that I had to convert to something "mrf" liked -- in gam, anyhow.)

When I ran the same thing in brm, I got an error:

Error in FUN(X[[i]], ...) : Stan does not support NA (in Zs_1_1) in data
In addition: Warning messages:
1: In sqrt(D) : NaNs produced
2: In sqrt(D) : NaNs produced
3: In sqrt(D) : NaNs produced
4: In sqrt(D) : NaNs produced
5: In sqrt(D) : NaNs produced
failed to preprocess the data; sampling not done

which is similar to what I get in gamm4. Hmmm... my conversion of the data seemingly satisfies the s(..., bs="mrf", ...) as used by gam, but not as used by gamm4 and brm. Perhaps they use different arguments in the underlying workhorse functions. So I'll have to try other spatial datasets.

P.S. I'm doing a ordinal logistic regression at work, and brm was way faster and more reliable than stan_polr. I didn't have time to figure everything out, but I never got stan_polr to run without divergence warnings, while brm did so the first time. I had to switch from cumulative to sratio (or something like that) before I started getting issues in brm. (cumulative makes sense for my use case, so the switch to sratio was more of an experiment, to see if any of the families would cause problems.)

paul-buerkner commented 7 years ago

Somewhere in the conversion, NA is produced and gam(m) seems to handle them correctly, while gamm4 and brm don't. This is somehow strange as all function are using mgcv on the backend for the preparation of smoothing terms. Would you mind sending me a reproducible example? Maybe I can fix this in brms and point Simond Wood to this problem in gamm4.

The difference between gam(m) on the one hand and gamm4 and brm on the other hand is just that the latter two don't suppert te and ti smoother, but s and t2 work fine. For gamm4 this is because in uses lme4 internally, but for brm it is just a matter of my time to spend on this feature as the implementation of te and ti is less straight forward. The t2 smoother seems to be a good alternative to te and ti, though.

wfolta commented 7 years ago

I started with the CARBayes example:

library (CARBayesdata)

data (GGHB.IG)
data (pricedata)
missing.IG <- setdiff (rownames(GGHB.IG@data), pricedata$IG)
missing.IG.row <- which (missing.IG==rownames(GGHB.IG@data))
propertydata.spatial <- GGHB.IG[-missing.IG.row, ]
propertydata.spatial@data <- data.frame (propertydata.spatial@data, pricedata)

propertydata.spatial@data$logprice     <- log (propertydata.spatial@data$price)
propertydata.spatial@data$logdriveshop <- log (propertydata.spatial@data$driveshop)

Then I did a quick-n-dirty conversion of that data structure to something that works with an mrf smooth:

xt <- list (polys=lapply (propertydata.spatial@polygons, function (x) x@Polygons[[1]]@coords))

with gam:

g1 <- gam (logprice ~ s(IG, bs="mrf", k=20, xt=xt) + s(crime) + rooms + sales + type + logdriveshop, data=propertydata.spatial@data)

and with brm:

g3 <- brm (logprice ~ s(IG, bs="mrf", k=20, xt=xt) + s(crime) + rooms + sales + type + logdriveshop, data=propertydata.spatial@data)

Which yields NaN errors.

I'm glad I was wrong about the mrf smoother. The paper I'd been reading did a comparison and found that Bayesian CAR usually had the best results, followed by GAMMs, with GAMs usually not doing as well as either. A GAMM is more flexible -- and I get to use brm -- so I'm willing to use the nearly-as-good-but-second-best option instead of CARBayes, if I can.

paul-buerkner commented 7 years ago

The problem occurs only if k is set manually to a value smaller than the number of sites. I will investigate this further.

paul-buerkner commented 7 years ago

I realized that also gamm stops with an error. The only function that works is gam. From what I see I believe this is a problem in the contruction of the "random effects" parameterization of smoothing terms being used in gamm, gamm4, as well as brm. In short I think that setting k to a value lower than the number of sites (270 in this case) causes trouble for in the MRF smoother, but I am not sure whether this is a bug or not. I will ask Simon Wood about it.

wfolta commented 7 years ago

If I try the gam without k=20, it gives an error:

Error in gam(logprice ~ s(IG, bs = "mrf", xt = xt) + s(crime) + rooms +  : 
  Model has more coefficients than data

which I think makes sense: if the MRF smooth has as many EDF as rows of data to begin with, where is the slack to fit all of my other coefficients (including multiple knots for crime)?

So I set k=20 and gam is happy. I didn't think to not set it for gamm4 and brm.

paul-buerkner commented 7 years ago

You can call brm without setting k but I am not sure how long it takes to fit the model / how well it converges.

Edit: Fitting the model with brms without setting k worked completely fine and took less than 2 minutes on my machine.

wfolta commented 7 years ago

I was trying that as I was writing. Just finished and using the brm call with just the k=20 removed, and I had to crank adapt_delta up 0.99 to get it down to 4 divergent transitions, which rook about 8 minutes to run on my laptop. Initial plots look okay, Rhat's okay, effective samples for sigma and MRF are only about 250. I'll try 0.999 and see if it works.

paul-buerkner commented 7 years ago

You could also try leaving adapt_delta at .99 and increase iter instead. 4 div transition are ok. Likely we just need more iterations / longer warmup.

Am 01.10.2016 17:31 schrieb "Wayne Folta" notifications@github.com:

I was trying that as I was writing. Just finished and using the brm call with just the k=20 removed, and I had to crank adapt_delta up 0.99 to get it down to 4 divergent transitions, which rook about 8 minutes to run on my laptop. Initial plots look okay, Rhat's okay, effective samples for sigma and MRF are only about 250. I'll try 0.999 and see if it works.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/128#issuecomment-250918408, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtANsPn9E1Gaxa0TVcjmMRgOYNEtLZks5qvnzqgaJpZM4KIWBK .

wfolta commented 7 years ago

Thanks, will do. I used to figure if I got it down to a handful of divergent transitions I was okay. Then I watched Betancourt's (?) youtube video describing how Stan samples and walked away thinking even a single divergent transition is a sign that it's having trouble sampling part of the space and all bets may be off. (He didn't emphasize it that way, that's just my conclusion.)

Rstan recommends a pairs plot to diagnose divergences. Unfortunately, I don't have access to a machine that's fast enough to complete a pairs plot. In practice, it's a huge lattice plot of dozens (or hundreds) of scatter plots and it's sloooooooow. (Some of this may be ggplot, but the one thing about R that I find frustrating is how it bogs down on plots with lots of points.) So, someday I'll succeed in printing a pairs plot and then I'll learn how to interpret it.

Any alternative diagnostic you can think of would be a super-nice addition.

UPDATE: I ran 4000 iterations and got 16 divergent transitions. Thought the results might be interesting for you. (Seeing 5 coefficients makes me think that the ultimate MRF has 5 knots (or maybe intermediate knots, for a total of 6 or 7), which is obviously a lot fewer than 270. It does take a few extra seconds to print the Compiling Model message, which I assume is the smooth calculating furiously.)

 Family: gaussian (identity) 
Formula: logprice ~ s(IG, bs = "mrf", xt = xt) + s(crime) + rooms + sales + type + logdriveshop 
   Data: propertydata.spatial@data (Number of observations: 270) 
Samples: 4 chains, each with iter = 4000; warmup = 2000; thin = 1; 
         total post-warmup samples = 8000
   WAIC: Not computed

Spline Effects: 
                          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sds(sIGbsEQ"mrf"xtEQxt_1)     0.04      0.01     0.03     0.05        491 1.01
sds(scrime_1)                 0.38      0.36     0.02     1.35       1161 1.01

Population-Level Effects: 
             Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
Intercept        4.08      0.13     3.84     4.33       4071 1.00
rooms            0.23      0.03     0.17     0.28       5078 1.00
sales            0.00      0.00     0.00     0.00       7007 1.00
typeflat        -0.28      0.06    -0.39    -0.16       2790 1.00
typesemi        -0.18      0.05    -0.28    -0.07       3860 1.00
typeterrace     -0.31      0.06    -0.44    -0.19       3172 1.00
logdriveshop     0.00      0.03    -0.06     0.06       2862 1.00
sIG_1           -0.16      0.17    -0.50     0.18       8000 1.00
sIG_2           -0.33      0.17    -0.67     0.01       8000 1.00
sIG_3            0.07      0.17    -0.26     0.39       8000 1.00
sIG_4           -0.05      0.17    -0.37     0.29       8000 1.00
sIG_5            1.17      0.17     0.84     1.49       8000 1.00
scrime_1        -0.08      0.08    -0.28     0.03        960 1.01

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma     0.16      0.02     0.12     0.18        516 1.01

Also, when I plot the marginal effects of the MRF (log price v IG), it does it but also prints a warning message, T_SHOW_BACKTRACE environmental variable..

paul-buerkner commented 7 years ago

The number of "fixed effects" used for a smoothing term does not indicate the number of knots. I am not sure on what basis mgcv decides how many of them are required.

You could use argument pars in pairs to select the parameters you want to be displayed. When selecting only few, making the pairs plot should only take a few seconds.

In my experience, a few divergent transition don't do too much harm, but the Stan team knows much better than I do of course. You could look at the diagnostic plots of shinystan by calling launch_shiny on a brmsfit object. Maybe @jgabry has some more suggestions.

jgabry commented 7 years ago

On Saturday, October 1, 2016, Paul-Christian Bürkner < notifications@github.com> wrote:

You could use argument pars in pairs to select the parameters you want to be displayed. When selecting only few, making the pairs plot should only take a few seconds.

Yeah you essentially always have to select only a few parameters. I would include lp__ and then strategically select a few other parameters depending on what you want to look it. The efficiency of the pairs plot isn't related to ggplot though (I think that may have been suggested in a previous post somewhere). The rstan pairs plot is still done with base R plotting.

In my experience, a few divergent transition don't do too much harm, but the Stan team knows much better than I do of course. You could look at the diagnostic plots of shinystan by calling launch_shiny on a brmsfit object. Maybe @jgabry https://github.com/jgabry has some more suggestions.

In theory a single divergence is enough to threaten the validity of the estimates. But a single divergence, or a single digit number of divergences, can usually be dealt with by lowering the initial step size and increasing adapt_delta (sometimes as high as 0.999 or even higher for nasty ones). So if you have 4 after increasing adapt_delta I would think you could probably get rid of them by increasing further and maybe also setting the initial stepsize to something like 1/10 of the stepsize it ends up with now. If this model is for an important project then I wouldn't ignore the divergences. (In other cases when there are a lot of divergences those tricks aren't going to work and the model needs to be reparameterized.)

Like Paul says it's true that often you won't see much of a difference in the estimates when you get rid of a small number divergences. But there's absolutely no guarantee.

Jonah

jgabry commented 7 years ago

I haven't looked at the model, but MRF and the like can often have difficult posteriors. Are these Gaussian MRFs? The parameterization of the covariance/precision matrices can also matter a lot.

On Saturday, October 1, 2016, Jonah Sol Gabry jgabry@gmail.com wrote:

On Saturday, October 1, 2016, Paul-Christian Bürkner < notifications@github.com javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

You could use argument pars in pairs to select the parameters you want to be displayed. When selecting only few, making the pairs plot should only take a few seconds.

Yeah you essentially always have to select only a few parameters. I would include lp__ and then strategically select a few other parameters depending on what you want to look it. The efficiency of the pairs plot isn't related to ggplot though (I think that may have been suggested in a previous post somewhere). The rstan pairs plot is still done with base R plotting.

In my experience, a few divergent transition don't do too much harm, but the Stan team knows much better than I do of course. You could look at the diagnostic plots of shinystan by calling launch_shiny on a brmsfit object. Maybe @jgabry https://github.com/jgabry has some more suggestions.

In theory a single divergence is enough to threaten the validity of the estimates. But a single divergence, or a single digit number of divergences, can usually be dealt with by lowering the initial step size and increasing adapt_delta (sometimes as high as 0.999 or even higher for nasty ones). So if you have 4 after increasing adapt_delta I would think you could probably get rid of them by increasing further and maybe also setting the initial stepsize to something like 1/10 of the stepsize it ends up with now. If this model is for an important project then I wouldn't ignore the divergences. (In other cases when there are a lot of divergences those tricks aren't going to work and the model needs to be reparameterized.)

Like Paul says it's true that often you won't see much of a difference in the estimates when you get rid of a small number divergences. But there's absolutely no guarantee.

Jonah

paul-buerkner commented 7 years ago

I am not really sure what exactly happens inside mgcv for this smooth term, but it appears to be working ok in Stan.

jgabry commented 7 years ago

I don't know what mgcv is doing either, but yeah Stan can definitely handle these models. Glad it seems to be working.

On Saturday, October 1, 2016, Paul-Christian Bürkner < notifications@github.com> wrote:

I am not really sure what exactly happens insde mgcv for this smooth term, but it appears to be working ok in Stan.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/128#issuecomment-250940114, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Q4sj07ItqYK77-XcLcWc-VIQMyZxks5qvtd_gaJpZM4KIWBK .

paul-buerkner commented 7 years ago

@wfolta Since the initial parsing bug has been fixed and the problem when specifying k occurs inside mgcv (I already informed Simon Wood about it), are you ok if we close this issue?

wfolta commented 7 years ago

I think you can close it, though there are really three moving parts here:

  1. Parsing bug: fixed. This part should be closed.
  2. Being able to specify k: being investigated, but not fixed.
  3. When it does work (either don't specify k or someday when that part's fixed), it's not clear that the models that use MRF will successfully converge. The hope is that it's mostly converging (with only a small number of divergences) now with unconstrained k and will not have problems once we can turn k down. But that's not guaranteed.

The last two aren't closed. So maybe the best strategy is to open a different kind of issue for those two and close this one. (2) is sort-of a bug, though it affects other non-gam functions so it's not a coding error on your part -- probably something Simon Wood's doing. (3) has to await (2) being clarified, and it may be that gam's MRF creates things that are tough for Stan to handle. (Ultimately, you may need to use brm's correlation methodology to create your own spatial options.)

paul-buerkner commented 7 years ago

I agree. The thing with (2) is that I can't do anything to fix it (other than informing Simon Wood, which I have already done) and once it is fixed, it will work in brms without any required changes.

Point (3) is tricky, because even when it doesn't work well in the end, it's hard to change it from my side. After all, brms just takes the smooth design matrices returned by mgcv and multplies them by the smoothing parameters (and adds an hierachical normal prior) using best practices for coding in Stan.

There is already an issue for implementing spatial correlation structures #6 and I just added a reminder to implement the CAR structure there.

Many thanks again for reporting this issue and for the detailed and interesting discussion about the topic.

paul-buerkner commented 7 years ago

Simon Wood responded saying that the error will be fixed in the next release of mgcv. I could already test it with the pre-release version and it seems to work correctly now. Using a lower k indeed decreases the sampling time drastically and I get rid of all divergent transition when setting adapt_delta = 0.95 for the example discussed in this issue.