tbates / umx

Making Structural Equation Modeling (SEM) in R quick & powerful
https://tbates.github.io/
44 stars 17 forks source link

basic ordinal ACE misspecified #150

Closed jpritikin closed 3 years ago

jpritikin commented 3 years ago

Something is wrong with the model parameterization

   true.A true.C true.E   ml.A   ml.C  ml.E
1   0.900  0.000  0.100  0.443  0.017 0.540
2   0.000  0.900  0.100 -0.021  0.502 0.519
3   0.000  0.000  1.000 -0.291  0.174 1.117

I'm attaching some code to reproduce the problem. See umxbug.R.txt

tbates commented 3 years ago

Can you give me any more background here in the report?

What umx function? Anything unusual about the data? Is it working with simpler data (convergence problem) or with different data? What are the rows of the table you show?

Quick look at your script: it's the umxACEv function with an n= 1000 simulated 9-level ordinalized variable? Anything odd about the cut-points? Does this problem happen in umxACE? Or in the umxACEv ordinal example? Have you looked inside the model at the threshold matrix? Anything look wrong?

I'm just not sure where to start looking.

tbates commented 3 years ago

It would help to strip the research script down to something that just demonstrates what you think is the issue and no more.

I notice that the discrepancy is not with another twin model on the data or with correlations observed in the generate data, but with a vector ace you create...

Have you verified that the correlations in the simulated datasets conform to the expected values?

mcneale commented 3 years ago

I have been poking around in the model and so far don’t see anything amiss. I’ll keep looking though. It is a bit weird that the estimate of E is always more than half the variance. Need to check the sim code as well.

tbates commented 3 years ago

Hi both,

results from umxACEv applied to the continuous version of your simulated data look good, so assume that part of data generation is OK.

In umx test code, the ordinalized bmi example is recovering the right values (given data degradation from ordinalizing), so it's a bit odd if it doesn't work here. Here's an example:

require(umx)
data(twinData)
mzData = twinData[twinData$zygosity %in% "MZFF", ]
dzData = twinData[twinData$zygosity %in% "DZFF", ]
m1 = umxACEv(selDVs = "bmi", dzData = dzData, mzData = mzData, sep = '')

# Cut bmi column to form ordinal obesity variables
cutPoints       = quantile(twinData[, "bmi1"], probs = c(.5, .2), na.rm = TRUE)
obesityLevels   = c('normal', 'overweight', 'obese')
twinData$obese1 = cut(twinData$bmi1, breaks = c(-Inf, cutPoints, Inf), labels = obesityLevels) 
twinData$obese2 = cut(twinData$bmi2, breaks = c(-Inf, cutPoints, Inf), labels = obesityLevels) 

# Make the ordinal variables into umxFactors (ensure ordered is TRUE, and require levels)
twinData[, c("obese1", "obese2")] = umxFactor(twinData[, c("obese1", "obese2")])
mzData = twinData[twinData$zygosity %in% "MZFF", ]
dzData = twinData[twinData$zygosity %in% "DZFF", ]
m2 = umxACEv(selDVs = "obese", dzData = dzData, mzData = mzData, sep = '')
A1 C1 E1
cont bmi 0.807 -0.063 0.256
ord obese 0.681 0.065 0.254

I can verify that the model parameters returned in umxACEv after the data in your script are processed through ordinalize2 aren't what was expected. However the data after ordinalize2 are not looking as I expected: MZ twins with similar continuous scores are ending up with quite different threshold assignments (note, the thresholds have numeric names...)

original continuous values

head(mzScore)
        bmi1        bmi2 rel
1 -0.6255840  0.08986287   1
2  0.2395135  0.48310280   1
3  2.0892120  2.30667109   1
4 -1.7674594 -2.02523947   1
5  1.6257317  1.90136905   1
6 -0.3669740 -0.77757352   1

After going through `ordinalize2`:

```R
head(mzScore)
                bmi1               bmi2 rel
1 -0.359631189939848  0.684166953988151   1
2  -1.40342933386785  -1.40342933386785   1
3  0.684166953988151   1.72796509791615   1
4  -3.49102562172384  -2.44722747779584   1
5 -0.359631189939848  0.684166953988151   1
6 -0.359631189939848 -0.359631189939848   1

This function is doing something other than simply cut(bmi1, breaks = c(-Inf, cutPoints, Inf), labels = obesityLevels), but at a glance I'm not sure what it's doing or what the call to outcome_prob is doing.

ordinalize2 <- function(score, br) {
    prob = sapply(1:(length(br)+1), function(x) outcome_prob(score, br, x))
    ord = apply(prob, 1, function(x) sample.int(length(br)+1, 1, prob=x)) 
    levels(ord) = c("-inf",br)
    class(ord) = c('ordered', 'factor')
    ord
}
mcneale commented 3 years ago

Correct, the bug appears to be with the simulation, not with umxACEv. Phew, on both sides of the pond :).