paul-buerkner / brms

brms R package for Bayesian generalized multivariate non-linear multilevel models using Stan
https://paul-buerkner.github.io/brms/
GNU General Public License v2.0
1.25k stars 177 forks source link

piecewiseSEM #303

Open mdainese opened 6 years ago

mdainese commented 6 years ago

Hi Paul, would you be willing to implement the piecewise structural equation modeling (SEM) in brms? Here, the R package developed by Jon Lefcheck using a frequentist approach: https://github.com/jslefche/piecewiseSEM package

paul-buerkner commented 6 years ago

Can you give me some details about what kind of models piecewiseSEMs are. That is what does "piecewise" mean in this context?

Am 18.12.2017 20:46 schrieb "mdainese" notifications@github.com:

Hi Paul, would you be willing to implement the piecewise structural equation modeling (SEM) in brms? Here, the R package developed by Jon Lefcheck using a frequentist approach: https://github.com/jslefche/piecewiseSEM package

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/303, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtADArcXhXQ3NsPjQ1Q9VRUJlJWqdEks5tBsD-gaJpZM4RF-il .

mdainese commented 6 years ago

I have multiple predictor and response variables and I would like to unite them in a single causal network. Actually, I'm testing separate multilevel models for each response variable in brms. Would it be possible to implement a confirmatory path analysis as described in this paper http://onlinelibrary.wiley.com/doi/10.1111/2041-210X.12512/full and implement in the piecewiseSEM package by using brms?

paul-buerkner commented 6 years ago

Thanks! I will look at the reference tomorrow, but maybe I can already help you. Suppose you have a multilevel mediation model with response y, predictor x and mediator z as well as multiple observations per person. Than you could specify a multivariate multilevel model via

bfy <- bf(y ~ x + z + (1|ID|person))
bfz <- bf(z ~ x + (1|ID|person)
bform <- bfy + bfz + set_rescor(FALSE)
fit <- brm(bform, <your data>, ...)

The |ID| part just makes sure the random effects are correlated and set_rescor(FALSE) ensures that no residual correlation between y and z is estimated since we already have z as predictor for y. Is this similar to what you had in mind?

jebyrnes commented 6 years ago

BTW: I'm a co-developer on the in-process new iteration of piecewiseSEM and am happy to implement questions. I was curious about the multivariate response framework you just implemented - can that be used here? The example you just gave seems to work, no?

There are two big problems that I see with piecewise in a Bayesian approach that I'm pondering. The first is model evaluation via d-separation. Not really sure what to do there, as that's a p-value based approach. The second is model comparison, but after some discussion with Florian Hartig, he seemed to agree that given that we can sum joint likelihoods, a summed WAIC should work. Is that what your model above would return?

The third tricky part which Bayesian methods might be great at is prediction - because one might want no coefficient error propagation, error propagation in coefficient uncertainty only, or, finally, full propagation of error all the way through. Can that be achieved here?

paul-buerkner commented 6 years ago

I would assume the new framework can be used for such purpose, but I really have to take a closer look at the models your are actually fitting via piecewiseSEM to tell for sure.

I have no idea what d-separation is to be honest, but indeed, I would use WAIC or (even better) LOO via LOO(fit1, fit2) for model comparison, which works for multivariate models as well.

You can make predictions with residual error and with or without random effects error via predict and predictions without residual error and with or without random effects error via fitted.

jebyrnes commented 6 years ago

You fit each response variable separately, but evaluate the model as a whole. using WAIC and LOO on multivariate models, as long as it's an aggregate score, should be great.

D-sep tests are as follows: For any model, there are a number of missing links. Each one is a conditional statement of independence - of directional separation. In a frequentist framework, we can get a probability for each of those claims and then calculate an aggregate score to summarize whether the model fit well or was likely missing something. One could do this in a Bayesian framework with posterior p-values, but, once you have them, combining them into a score seems...odd. Can send you some key refs if you are interested.

On predict, would error propogate? For example, assume the following model.

x -> y1 -> y2

And y1 ~ N(y1_fit, sigma_1^2), y2 ~ N(y2_fit, sigma_2^2)

Now, if I input only a predicted value of x, could I get

  1. a prediction of y1_fit and y2_fit?
  2. a credible interval of y1_fit and y2_fit, assuming that the coefficient error in how x predicts y1_fit passes through to how y2_fit's CI is estimated?
  3. The same, but, now allowing for sigma1_^2 to propagate through to calculating y2 and it's variability? Do you see what I'm getting at?
mdainese commented 6 years ago

I've just tried running a multivariate multilevel model using a simplified dataset:

bfy <- bf(N~S + (1|ID|groupID))
bfz <- bf(FS~N+ (1|ID|groupID))
fit <- brm(bfy+bfz + set_rescor(FALSE), data=set6,cores=4,refresh = 0,chains = 2)

And these are the results:

Group-Level Effects: 
~groupID (Number of levels: 15) 
                                Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sd(N_Intercept)                     0.11      0.03     0.07     0.18        735 1.00
sd(FS_Intercept)                    0.04      0.02     0.00     0.09        853 1.00
cor(N_Intercept,FS_Intercept)      -0.51      0.37    -0.98     0.45       1135 1.00

Population-Level Effects: 
              Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
N_Intercept       0.45      0.05     0.36     0.54        874 1.00
FS_Intercept      0.41      0.02     0.36     0.46       1159 1.00
N_S              -0.13      0.05    -0.22    -0.04       1953 1.00
FS_N              0.19      0.04     0.10     0.27       2000 1.00

Family Specific Parameters: 
          Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
sigma_N       0.30      0.01     0.28     0.32       2000 1.00
sigma_FS      0.28      0.01     0.27     0.30       2000 1.00

I've also run the model using the piecewiseSEM in a frequentist framework. Here, the results:

  response predictor   estimate  std.error p.value    
1       N        S   -0.1327276 0.04742636  0.0058  **
2      FS        N    0.1723488 0.04162190  0.0001 ***

In principle, the multivariate multilevel model seems to work well.

paul-buerkner commented 6 years ago

@mdainese That's good news!

@jebyrnes If you like please send me some reference for the D-separation. It is unlikely to be implemented in brms at some point but might still be interesting.

With regard to predictions, the situation is as follows: If you define new data

newdata = data.frame(x = 1, y1 = NA)

and then try to run predict(fit) or fitted(fit), brms will complain, since y1 appears as predictor for y2.

Hence, we have to propagate uncertainty with a bit more manual approach. Let's start with predictions without including sigma_y1:

fy1 <- fitted(fit, newdata = newdata, resp = "y1", summary = FALSE)
newdata2 <- expand.grid(x = newdata$x, y1 = as.vector(fy1))
fy2 <- fitted(fit, newdata = newdata2, resp = "y2", summary = FALSE)
fy2 <- as.matrix(diag(fy2))
colMeans(fy2)
posterior_interval(fy2)

The exact same approach can be used with predict instead of fitted in incorporate residual error (i.e. sigma_y1) in the predictions as well.

mdainese commented 6 years ago

Sorry, I've just realized that the google group brms-users it would have been the right room. Anyway, thank you for the suggestion. You're awesome! Great also to see that @jebyrnes is participating in this issue!

paul-buerkner commented 6 years ago

All good! If I hadn't just implemented multivariate models in this generality, piecewiseSEMs would have been a feature request, for which github would have been the perfect place to ask. I am closing this issue now.

jebyrnes commented 6 years ago

Two nice intros to tests of D-Separation can be foud at https://www.sciencedirect.com/science/article/pii/S1433831904700803 and https://www.jstor.org/stable/27650991?seq=1#page_scan_tab_contents

Bill Shipley's book on cause and correlation in biology has a great deal more detail.

In essence, if we have

x -> y1 -> y2

Our one conditional independence claim here is that of x,y2 given y1. So, we'd fit, say, y2 ~ y1 + x and then evaluate the p-value (in a frequentist framework - so, requires some translation) of x on y2. If we have a large number of independence claims (and @jslefche has some good code for building up what those claims are), then the p-values for each are traditionally combined into a test statistic that is approximately chi-squared. This is taken as an omnibus test of mis-fit.

jebyrnes commented 6 years ago

Related - and I'm happy to start a new issue/feature request for this, can brms currently handle latent variables? That's another tricky topic, but thought I'd ask, as it's at the core of a lot of more traditional covariance-based SEM, although we haven't yet worked out how to shoehorn it into piecewise (although I have some ideas - just, I run into weird things happening with them being unstable).

jebyrnes commented 6 years ago

On prediction, two quick notes/questions

1) OK, so, you have to propogate through the network manually. Hrm. I am working on something to handle this in piecewiseSEM - I'll share it when it's done, as it might be somewhat straightforward.

2) On these data frames with NAs, let's say I have

x -> y1 -> y2 -> y3

Do I need newdata = data.frame(x = 1, y1 = NA, y2 = NA) for predict and fitted? This might get unwieldy with multiple endogenous variables in even not terribly complex models.

cbrown5 commented 6 years ago

@jebyrnes Here is a good ecological example of Bayesian SEM, that also talks about d-sep testing: https://www.researchgate.net/profile/Yann_Clough/publication/230754566_A_generalized_approach_to_modeling_and_estimating_indirect_effects_in_ecology/links/55fc041f08aec948c4afd1e7.pdf

I don't think the d-sep test is neccessary in a Bayes framework though, as you can just fit one complete model and then do your evaluation on that (ideally with a cross-validated measure)

paul-buerkner commented 6 years ago

Thanks for all the details! As I said, it is unlikely that some short of Bayesian d-seperation will eventually make it into brms, since we already have cross-validation stuff via LOO and related methods, but I will take a closer look at what you send me.

Latent variables in a non-random-effects sense are not yet supported. At least not in the way SEMs can handle them. This is definitly worth a new issue.

Propagating through the network manually is currently more of a workaround, and should eventually be replaced by something that feels more natural and is less error prone.

With regard to the NAs, yes currently you have to specify it that way, because else brms will complain that these variables are missing. It is just something I need to code properly so that only variables in the currently evaluated sub-models are required instead of all variables.

jebyrnes commented 6 years ago

So.... http://rpubs.com/jebyrnes/343408/ - thoughts?

paul-buerkner commented 6 years ago

Your example does not work, because you tried to predict rich directly, with cover being NA, which of course causes problems.

To make it work, you have to first predict cover and in a second step rich using the predicted values of `cover.

2017-12-20 19:43 GMT+01:00 Jarrett Byrnes notifications@github.com:

Hrm. No, prediction...doesn't work? This is a classic teaching example I use (and am writing up a .Rmd about)

library(brms)

keeley <- read.csv("http://byrneslab.net/classes/lavaan_materials/Keeley_rawdata_select4.csv ")

rich_mod <- bf(rich ~ firesev + cover) cover_mod <- bf(cover ~ firesev)

k_fit_brms <- brm(rich_mod + cover_mod + set_rescor(FALSE), data=keeley, cores=4, chains = 2)

prediction

newdata = data.frame(firesev = 5, cover = NA)

newpred <- predict(k_fit_brms, newdata=newdata, resp = "rich", nsamples = 1000, summary = FALSE)

This produces

Error in tcrossprod(b, X) : non-conformable arguments Error: Something went wrong. Did you transform numeric variables to factors or vice versa within the model formula? If yes, please convert your variables beforehand. If no, this might be a bug. Please tell me about it.

So, I'm telling you about it!

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/303#issuecomment-353148079, or mute the thread https://github.com/notifications/unsubscribe-auth/AMVtAK-1l-h_T4SwB_FLFHhv2Mw6pE9nks5tCVU7gaJpZM4RF-il .

paul-buerkner commented 6 years ago

@jebyrnes Your link to rpubs doesn't work for me.

paul-buerkner commented 6 years ago

Update: in the latest dev version of brms, you no longer have to specify NA in newdata for unused variables. Still the manual "2-step" procedure as decribed above remains necessary.

jebyrnes commented 6 years ago

Awesome - also, check rpubs again - pulls up for me... Or try http://rpubs.com/jebyrnes/brms_bayes_sem

jebyrnes commented 6 years ago

FYI, see sortDag in https://github.com/jslefche/piecewiseSEM/blob/2.0/R/Dag.R for putting equations in order in an SEM from exogenous to terminal endogenous.

paul-buerkner commented 6 years ago

Nice! Please note that I made slight changes to the way we propagate uncertainty in brms. For simplicity, I am posting it again here:

newdata = data.frame(x = 1)
fy1 <- fitted(fit, newdata = newdata, resp = "y1", summary = FALSE)
newdata2 <- expand.grid(x = newdata$x, y1 = as.vector(fy1))
fy2 <- fitted(fit, newdata = newdata2, resp = "y2", summary = FALSE)
fy2 <- as.matrix(diag(fy2))
colMeans(fy2)
posterior_interval(fy2)

The new thing is fy2 <- as.matrix(diag(fy2)). This makes sure that posterior samples between the regression coefficients and the samples of y1 obtained via fy1 really match. Otherwise, we will likely overestimate uncertainty.

derekpowell commented 6 years ago

Just a question about propagating uncertainty in this way:

For variables with multiple endogenous predictors (or nodes with multiple parents, in graphical terms) it seems like the number of samples being generated is going to multiply exponentially and things will get pretty computationally hairy pretty quick. E.g., for a node with just 3 endogenous parents and 1000 samples for each, doing expand.grid() will yield 1 billion rows, and the resulting predict call will give 1 trillion values.

Maybe that's just the nature of the beast? Or is there a smarter workaround I'm not seeing?

paul-buerkner commented 6 years ago

Not necessarily, but I haven't thought about it enough to be honest. Could you give me an example (in terms of a model) for you situation to help me think about it?

derekpowell commented 6 years ago

It's possible I'm just getting myself confused. I guess I'm asking if this approach can be generalized to propagate uncertainty throughout a larger network with more "layers". Rereading this thread, it sounds like it can. Here's an example to illustrate my question, since I'm having a bit of trouble putting it into words this morning.

Imagine we have a network with the following edges:

A -> C
B -> C
B -> D
C -> D
C -> E
D -> E

This network has exogenous nodes A and B, with all other nodes endogenous. Node E has two endogenous parent nodes, C and D. So predicting E from known A and B values means propagating uncertainty through C and D. I guess I'm wondering if the fy2 <- as.matrix(diag(fy2)) bit works properly in this case? Without it, you get the combinatorial explosion I was worried about.

Here's some code to simulate data and make these predictions following your example earlier:

library(brms)
library(dplyr)

# # simulate some data for a network ...

sim_df <- data.frame(A = rnorm(1000, 0, 3)) %>%
  mutate(B = rnorm(1000, 0, 2)) %>%
  mutate(C = .5 * A + .25 * B + rnorm(1000, 0, 1)) %>%
  mutate(D = .8 * B + -.35 * C + rnorm(1000, 0, 2)) %>%
  mutate(E = .15 * C + .33 * D + rnorm(1000, 0, 3))

# # fit the model ...

mod_C <- bf(C ~ A + B)
mod_D <- bf(D ~ C + B)
mod_E <- bf(E ~ C + D)

fit_brm_sim <- brm(mod_C + mod_D + mod_E + set_rescor(FALSE),
  data = sim_df,
  chains = 2,
  cores = 2
)

# # do predictions ...

# predict C from A and B
newdata <- data.frame(A = 2, B = 1)

pred_C <- predict(fit_brm_sim,
  newdata = newdata,
  resp = c("C"), nsamples = 1000,
  summary = FALSE
)

# predict D from B and (predicted) C
newdata2 <- expand.grid(A = newdata$A, B = newdata$B, C = as.vector(pred_C))

pred_D <- predict(fit_brm_sim,
  newdata = newdata2,
  resp = c("D"), nsamples = 1000,
  summary = FALSE
)
pred_D <- as.matrix(diag(pred_D))

# predict E from predicted C and predicted D
newdata3 <- newdata2 %>% mutate(D = as.vector(pred_D))

pred_E <- predict(fit_brm_sim,
  newdata = newdata3,
  resp = c("E"), nsamples = 1000,
  summary = FALSE
)

pred_E <- as.matrix(diag(pred_E))

# # result is draws from posterior?
posterior <- newdata3 %>% mutate(E = as.vector(pred_E))

Am I doing this right? And at the end of that, is posterior really a proper sample of the posterior conditioned on A = 2 & B = 1? Originally I was thinking you would need to use expand.grid() at each step, e.g., newdata3 <- expand.grid(data.frame(A=2, B=1, C=as.vector(pred_C), D=as.vector(pred_D)))--which leads to too many samples and crashes my computer.

(Thanks, and apologies if this isn't the right place for this, happy to move this to the new board if it's not).

paul-buerkner commented 6 years ago

This looks reasonable to me and I think you are doing this right. I will basically need to implement something similar when automating this process and so your thoughts and ideas are very helpful to me!

strengejacke commented 6 years ago

May I "hijack" this thread, because I wanted to ask if there's a tutorial or whatever that "translates" lavaan-/em-formula-syntax into regression-formula-syntax? I understand @jebyrnes tutorial comparing piecewiseSEM with brms. But I have, for instance, formulas for growth-models in lavaan, that I would love to "translate" into brms-syntax, like:

model2b <- '
  # intercept and slope with fixed coefficients
  i =~ 1*Pig1 + 1*Pig2 + 1*Pig3
  s =~ 0*Pig1 + 1*Pig2 + 2*Pig3
  # regressions
  i ~ BDI1 + sex
  s ~ i + BDI1
  '

or

model3a <- '
  # intercept and slope with fixed coefficients
  i =~ 1*Pig1 + 1*Pig2 + 1*Pig3
  s =~ 0*Pig1 + 1*Pig2 + 2*Pig3
  # regressions
  i ~ sex
  s ~ i
  # time-varying covariates
  Pig1 ~ BDI1
  Pig2 ~ BDI2
  Pig3 ~ BDI3
  '

I don't want to (mis-)use this thread for tutorial-stuff, but a pointer where I may find more on this topic would be very nice! :-)

paul-buerkner commented 6 years ago

See #304 for the issue about SEM-style latent variables. The models you are presenting don't need that and can just be fitted with ordinary multilevel syntax in brms. See vignette(brms_overview) and vignette(brms_multilevel) for a start.

Going forward, plase ask questions on http://discourse.mc-stan.org/ using the "Interfaces - brms" tag.

strengejacke commented 6 years ago

Thanks! Wasn't sure if this question was too "generic" for brms and the Stan forums, but I'll go over there for questions which are not "GitHub"-related.

jgabry commented 6 years ago

On Wed, Apr 18, 2018 at 3:02 AM Daniel notifications@github.com wrote:

Thanks! Wasn't sure if this question was too "generic" for brms and the Stan forums, but I'll go over there for questions which are not "GitHub"-related.

Yeah it’s not always clear where the right place to ask is. In general we recommend the following:

You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/paul-buerkner/brms/issues/303#issuecomment-382284421, or mute the thread https://github.com/notifications/unsubscribe-auth/AHb4Q_W52YXsf5yAskPrjp-r_1VubhT5ks5tpuUggaJpZM4RF-il .

zacksteel commented 4 years ago

I've built a pretty complex multivariate model using brms (thank you for the fantastic package by the way!) analogous to the examples above. I'm now trying to make predictions through the model network but am finding it difficult. Specifically, rather than having a single set of exogenous variables, I have dozens. This extra dimension in the data makes the expand.grid() and as.matrix(diag()) steps trickier and potentially more error prone (I think). Are there any examples similar to Derek Powell's above that walk through "manual" predictions with many sets of exogenous variables? Or better yet, will predictions through a model network be automated in the new future?

Thanks

paul-buerkner commented 4 years ago

In the future (i.e., brms 3.0), I am planning on automatically propagating uncertainty through the network, but this will require a refactor of how data/parameter interaction works as predicted data become parameter draws but is to be used in place of data. Not sure how this is going to look exactly though.