simsem / simsem

Package for simulating data for structural equation modeling
http://simsem.org/
56 stars 24 forks source link

Help: Convert lavaan syntax style to simsem matrix style #57

Closed b1azk0 closed 4 years ago

b1azk0 commented 5 years ago

I was trying to run a monte carlo simulation for estimating a sample size for a hypothesized moderated mediation (https://github.com/simsem/simsem/issues/56).

The goal here is sample size number for a study where: cb = X is experimental condition coded as -0.5 for one group and 0.5 for the other group gnt = M continuous mediator mean centered cn = V continuous moderator mean centered cngn = MxV product of cn * gnt INT = Y uncentered

Based on previous studies (SPSS PROCESS macro outputs) I took parameters for my population model:

popModel <- "
# Regression paths
gnt ~ 1.202*cb
INT ~ 0.576*gnt
INT ~ -0.006*cn
INT ~ 0.069*cngn
INT ~ -0.03*cb

# (co)variances
gnt ~~ 0.616*gnt
INT ~~ 0.362*INT
cb ~~ 0.250*cb
cb ~~ 0.029*cn
cb ~~ -0.165*cngn
cn ~~ 2.528*cn
cn ~~ -0.102*cngn
cngn ~~ 2.919*cngn
"

Then I've setup analysis model testing for indirect effects and conditional indirect effects the following way:

analyzeModel <- "
# Model 14 PROCESS Hayes - moderated mediation
gnt ~ a*cb
INT ~ b1*gnt + b2*cn + b3*cngn + c*cb

indirect := a*b1
direct := c

ab3 := a * b3
loCN := a * b1 + ab3 * -0.5
hiCN := a * b1 + ab3 * 0.5
"

If I assume that above models are written properly (are they?) the simulation code I wrote is this:

Output <- sim(NULL, n=200:500, model = analyzeModel, generate = popModel, lavaanfun = "sem")
# summary(Output)
# plotCutoff(Output, 0.05)

Cpow <- getPower(Output)
findPower(Cpow, "N", 0.80)
plotPower(Output, powerParam = c("ab3", "indirect"))

Besides not being sure wheter or not I coded both population and analysis model the right way, I would also like to have the possibility to randomize population model parameters (eg. drawn from a uniform distribution of effects I observed in earlier studies). As I belive this is not possible via lavaan model syntax but I can't work my way to translate it into simsem model syntax.

Could you please help me out?

TDJorgensen commented 5 years ago

The syntax looks basically correct if those were 5 different variables, but your comment says this is moderated mediation. If your model includes a product term, you cannot generate it as a separate variable because it would not be the product of the two lower-order terms. Moderation falls outside the purview of covariance structure analysis (which only models linear effects), so simsem cannot be used to generate your data. You would need to write your own function to:

You could then pass that data-generating function to sim() because the generate= argument accepts a function with only a sample-size argument. You can find some guidance online, for example:

https://stats.stackexchange.com/questions/59062/multiple-linear-regression-simulation

Good luck, Terrence

b1azk0 commented 5 years ago

Oh my... That sounds complicated to say the least and far from my comfort zone.

If I may, I'd like to learn something, despite the fact that maybe I won't be able to compute the sample size in question.

This is how the hypothesized model look like (as in Hayes PROCES templates):

image

From what covariance matrix I should start (keeping in mind I have some real data I want to with) ? For the outcome:

Product terms key: 
 Int_1    :        GNT      x        CN 

Covariance matrix of regression parameter estimates: 
           constant   Cyberbal        GNT         CN      Int_1 
constant     0.0037    -0.0047     0.0015     0.0002    -0.0001 
Cyberbal    -0.0047     0.0089    -0.0027    -0.0003     0.0002 
GNT          0.0015    -0.0027     0.0019     0.0001     0.0000 
CN           0.0002    -0.0003     0.0001     0.0005     0.0000 
Int_1       -0.0001     0.0002     0.0000     0.0000     0.0004

Fot the mediator

Covariance matrix of regression parameter estimates: 
           constant   Cyberbal 
constant     0.0041    -0.0041 
Cyberbal    -0.0041     0.0077
TDJorgensen commented 5 years ago

Well, X and V are the exogenous predictors, so start by generating those from their covariance matrix (e.g., using the mvrnorm() function in the MASS package, which returns a matrix that you can convert to a data.frame). Then you add the product term to the data set (e.g., dat$xv <- dat$x * dat$v). Then you can generate the endogenous M and Y variables using their regression model parameters (and generating random errors), as shown in the page I linked to above.

b1azk0 commented 5 years ago

Hi Terrence, thank for the hint. Unfortunately I got stuck at the beginning - what would be the sample size for mvrnom() to begin with? Would semTools::indProd() work here to make the generation process easier?

TDJorgensen commented 5 years ago

You choose the sample size.

indProd() is for multiple indicators of a latent interaction between common factors. You only have observed variables.

b1azk0 commented 5 years ago

I know that I should choose the sample size, but if I'm after estimating the sample size, this should be somehow connected to sim(n=100:500) right?

TDJorgensen commented 5 years ago

Yes, as stated above, you would need to write a function that has sample size as its argument, then pass that to sim(..., generate=). So you would still be able to use a variable n for estimating power and choosing N.

b1azk0 commented 5 years ago

Terrence, thanks for your help. I wasn't able to move forward and got stuck at the beginning.

I have the 1st part sorted (I think)


> n=100
> 
> (Sigma<-df %>% select(M, V) %>% var(na.rm = T))
           M          V
M 1.45604513 0.09586241
V 0.09586241 2.10041305
> 
> dat <- mvrnorm(n, rep(0, 2), Sigma) %>% as.data.frame() %>%
+   mutate(MxV = M*V)
> 

I select M and V from pilot study database and store they covariance matrix as Sigma. Then I compute their product and store the 3 variables in a data.frame.

So the next step would be generating X and Y? But how do I account for what is already in the data frame given orginal model parameters?

b1azk0 commented 5 years ago

Good evening Terrence. I was doing my best and eventually couldn't solve it. Is there any chance I can help me with this or should I close this as off topic? Thanks

b1azk0 commented 5 years ago

Well, X and V are the exogenous predictors, so start by generating those from their covariance matrix (e.g., using the mvrnorm() function in the MASS package, which returns a matrix that you can convert to a data.frame). Then you add the product term to the data set (e.g., dat$xv <- dat$x * dat$v). Then you can generate the endogenous M and Y variables using their regression model parameters (and generating random errors), as shown in the page I linked to above.

I think I know where my problem is. You wrote X and V are the exogenous predictors. BUT: For the moderator part I need to have a product of M and V (not X and V). In terms of PROCESS templates - I think you wrote about model 7, while I'm thinking about model 14.

Does this make any sense?

I already generated M and V from their covariance matrix: https://github.com/simsem/simsem/issues/57#issuecomment-466420873 using they (co)variance calculated from real dataset. Adding the product of them is easy and I did it how you wrote. At this point I have a data frame with M, V and MxV and that's where I stopped.

TDJorgensen commented 5 years ago

But M is not exogenous. It sounds like you need to simulate X and V as exogenous variables, then simulate M as an endogenous variable predicted by X (and whose residuals are correlated with V). So use MASS::mvrnorm() to simulate a correlated set of 3 variables: X, V, and the residuals of M.

b1azk0 commented 5 years ago

Ahh... This makes perfect sense! I will report back, once the data generating function is working.

Thanks