greta-dev / greta

simple and scalable statistical modelling in R
https://greta-stats.org
Other
527 stars 63 forks source link

Modeling a correlated system of equations #166

Closed sadatnfs closed 6 years ago

sadatnfs commented 6 years ago

Hi Nick

I'd like to model a system of equations for 3 outputs (call them Y1, Y2, Y3), such that they are each influenced by some fixed effects (call them X1, X2, X3), and there exists some correlation between the 3 outputs in some 3x3 matrix structure. Here's my attempt at simulating and setting up the Greta model, but I feel like the correlation structure and multivariate distributions aren't correct. I looked at the issue in #131 but I think that's more for an intertwined network rather than correlated equations eh?

Here's my code. It'd be awesome I could get some feedback! Thanks!

Nafis


  ############################################################
  ## Our model is defined as:
  #### Y_1 ~ X_1 + X_2
  #### Y_2 ~ X_1
  #### Y_3 ~ X_2
  ## where there exists correlation between Y_1, Y_2, Y_3
  ############################################################

  ############################################################
  #### Simulate data
  ############################################################

  pacman::p_load(greta, data.table)
  N <- 2500
  X_1 <- rnorm(N, 2,3)
  X_2 <- rnorm(N, 1,1)
  Y_1 <- 0.1*X_1 + 0.3*X_2 + rnorm(N)
  Y_2 <- 0.2*X_1 + rnorm(N)
  Y_3 <- 0.2*X_2 + rnorm(N)

  ## Check correlation of Y_*
  cor(cbind(Y_1, Y_2, Y_3))

  ### Transform all the data into greta arrays
  X_1 <- as_data(X_1)
  X_2 <- as_data(X_2)
  Y_1g <- as_data(Y_1)
  Y_2g <- as_data(Y_2)
  Y_3g <- as_data(Y_3)

  ### Fixed effects on the Xs from the 3 equations

  Y1_X1 <- normal(0, 10)
  Y1_X2 <- normal(0, 10)
  Y2_X1 <- normal(0, 10)
  Y3_X2 <- normal(0, 10)

  ### Noise in Ys
  Y1_sigma <- lognormal(0, 1)
  Y2_sigma <- lognormal(0, 1)
  Y3_sigma <- lognormal(0, 1)

  ### Create the likelihoods of Ys
  Y1_LL <- Y1_X1*X_1 + Y1_X2*X_2
  distribution(Y_1g) = normal(Y1_LL, Y1_sigma)

  Y2_LL <- Y2_X1*X_1 
  distribution(Y_2g) = normal(Y2_LL, Y2_sigma)

  Y3_LL <- Y3_X2*X_2
  distribution(Y_3g) = normal(Y3_LL, Y3_sigma)

  #### Wrap the 3 Ys in a multivariate normal?

  ## Define correlation matrix of the Ys
  Y_Omega <- greta::lkj_correlation(1,dim = 3)
  Y_sigma <- greta::lognormal(0,1,dim=3)
  sigma_mat <- greta_array(dim = c(3, 3))
  diag(sigma_mat) <- Y_sigma
  Sigma <- sigma_mat %*% Y_Omega %*% sigma_mat

  Y_all <- cbind(Y_1, Y_2, Y_3)

  mod <- model(Sigma, Y_Omega, Y_sigma, Y1_sigma, Y2_sigma, Y3_sigma, Y1_X1, Y1_X2, Y2_X1, Y3_X2)

  draws <- mcmc(mod, n_samples = 500)
  colMeans(draws[[1]])
goldingn commented 6 years ago

Hey Nafis!

Sorry I'm so late replying, hectic days with travel and all.

I'm not sure I quite understand the model. From the way you've generated the data, there's no correlation in the Ys except that due to their shared relationships with the Xs.

So defining the independent distributions over Y_xg should be all you need to do. As it's currently written, the last section of code is an unconnected model, that should just be sampling from its prior.

goldingn commented 6 years ago

If the model you want is the one you're sampling from, then you're already there. If you wanted additional correlations in the Ys on top of what you are simulating, you would want to do something like:

mu <- cbind(Y1_LL, Y2_LL, Y2_LL)
distribution(Y_all) <- multivariate_normal(mu, Sigma) 
sadatnfs commented 6 years ago

Thanks Nick! Ultimately, when I fit my model on real data, I'd want to run the model with some latent correlation/covariance in the Ys (I think I just didn't do a good job of simulating data with correlated Ys !).

I just realized that you replied to me as I was writing this reply, that's pretty much what I'd want!

Question on that: it looks like mu's dimension is N x 3 for N = nrow(Y), but I'm getting this error:

distribution(Y_all) <- multivariate_normal(mu, Sigma)
Error: mean must be a 2D greta array with one column, but has dimensions 2500 x 3

where I evaluated Sigma as in the code (3x3). Any thoughts?

goldingn commented 6 years ago

Aah that's right, because I haven't yet resolved #123, the current implementation expects all the rows to have the same mean vector. You could loop through the rows defining the distribution for each row as a multivariate normal with mu = mu[i,] as a stopgap.

I'll try to prioritise that fix though.

sadatnfs commented 6 years ago

Awesome that's what I was thinking as well, but just didn't know if it was kosher. Thanks Nick!

sadatnfs commented 6 years ago

Alas, it's giving this weird issue:


Y_all <- as_data(t(cbind(Y_1, Y_2, Y_3)))
mu <- t(cbind(Y1_LL, Y2_LL, Y2_LL))

for(i in 1:nrow(mu)) {
  distribution(Y_all[i,]) <- multivariate_normal(mu[,i], Sigma)
}

Error: distributions can only be assigned to data greta arrays

I had to do the double transpose to make multivariate_normal play well with the dimensionality.

I think it might be best if I hold off until #123 ends up getting resolved eh?

goldingn commented 6 years ago

Yeah, the reason you're getting that is that extracting with [ is technically an operation, and greta won't let you define distribution on operation greta arrays, because it's very rare that that is a sensible thing to do.

You can do this:

Y_all <- cbind(Y_1, Y_2, Y_3)
mu <- cbind(Y1_LL, Y2_LL, Y3_LL)
t_mu <- t(mu)
for (i in seq_len(N)) {
  Y_all_i <- t(Y_all[i, ])
  distribution(Y_all_i) <- multivariate_normal(t_mu[, i], Sigma)
}

Though actually it will take a very long time to define the model, because of the for loop. So yes, it would probably be worth waiting if you can!