rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
341 stars 30 forks source link

joint_tests omits covariate transformed with asin(sqrt()) via make.tran() #438

Closed hnguyen19 closed 10 months ago

hnguyen19 commented 11 months ago

Describe the bug

joint_tests omits covariate transformed with asin(sqrt()) via make.tran()

To reproduce

set.seed(83) library(emmeans)

dummy <- expand.grid(
  covar  = runif(72, min = 0, max = 1),
  resp = runif(72, min = 0, max = 2),
  site = c("a", "b", "c", "d", "e", "g"),
  C = 1:4,
  R = c("i", "j", "k")
)

as.tran <- make.tran("asin.sqrt")
mod <- with(as.tran, lm(sqrt(resp) ~ site  +  C * R * linkfun(covar), data = dummy))
joint_tests(mod)

model term df1    df2 F.ratio p.value
 site         5 373231   0.000  1.0000
 C            1 373231   0.000  1.0000
 R            2 373231   0.000  1.0000
 C:R          2 373231   0.000  1.0000

Switching resp and covar, and the joint_tests(mod2) below gave the full ancova model

mod2 <- with(as.tran, lm( linkfun(covar) ~ site  +  C * R * sqrt(resp), data = dummy))
joint_tests(mod2)
 model term df1    df2 F.ratio p.value
 site         5 373231   0.000  1.0000
 C            1 373231   0.000  1.0000
 R            2 373231   0.000  1.0000
 resp         1 373231   0.000  1.0000
 C:R          2 373231   0.000  1.0000
 C:resp       1 373231   0.000  1.0000
 R:resp       2 373231   0.000  1.0000
 C:R:resp     2 373231   0.000  1.0000

Expected behavior

A full ANCOVA table similar to the table obtained from joint_tests(mod2).

rvlenth commented 11 months ago

The covariate is actually taken into account, but then the results are suppressed because those terms have zero d.f..:

> joint_tests(mod, show0df = TRUE)
 model term   df1    df2 F.ratio p.value note
 site           5 373231   0.000  1.0000     
 C              1 373231   0.000  1.0000     
 R              2 373231   0.000  1.0000     
 covar          0     NA      NA      NA  d e
 C:R            2 373231   0.000  1.0000     
 C:covar        0     NA      NA      NA  d e
 R:covar        0     NA      NA      NA  d e
 C:R:covar      0     NA      NA      NA  d e
 (confounded)   0 373231     NaN     NaN     

d: df1 reduced due to linear dependence 
e: df1 reduced due to non-estimability 

This in turn happens because by default, joint_tests() reduces the covariate using the meanint() function:

> meanint(dummy$covar)
[1] -0.5176167  1.4823833
> as.tran$linkfun(meanint(dummy$covar))
[1] NaN NaN
Warning messages:
1: In sqrt(mu/alpha) : NaNs produced
2: In asin(sqrt(mu/alpha)) : NaNs produced

That is, the transformation is applied to values outside its valid range. If you use a function that yields values within the interval (0,1), you get less puzzling results:

> joint_tests(mod, cov.reduce = function(x) mean(x) + c(-0.1, 0.1))
 model term df1    df2 F.ratio p.value
 site         5 373231   0.000  1.0000
 C            1 373231   0.000  1.0000
 R            2 373231   0.000  1.0000
 covar        1 373231   0.000  1.0000
 C:R          2 373231   0.000  1.0000
 C:covar      1 373231   0.000  1.0000
 R:covar      2 373231   0.000  1.0000
 C:R:covar    2 373231   0.000  1.0000
rvlenth commented 11 months ago

BTW, the example is kind of flaky because you use response and covariate values as dimensions in the grid you created. I think you intended to do:

set.seed(1234)  # so random numbers are reproducible

dummy <- expand.grid(
  site = c("a", "b", "c", "d", "e", "g"),
  C = 1:4,
  R = c("i", "j", "k")
)
dummy$covar <- runif(72, min = 0, max = 1)
dummy$resp <- runif(72, min = 0, max = 2)

Now if we re-fit the model, we see P values other than 1:

> joint_tests(mod, cov.reduce = \(x) mean(x) + c(-0.1, 0.1))
 model term df1 df2 F.ratio p.value
 site         5  55   1.038  0.4049
 C            1  55   0.728  0.3972
 R            2  55   4.993  0.0102
 covar        1  55   0.000  0.9948
 C:R          2  55   0.623  0.5400
 C:covar      1  55   0.661  0.4197
 R:covar      2  55   0.513  0.6013
 C:R:covar    2  55   0.202  0.8173
hnguyen19 commented 11 months ago

The covariate is actually taken into account, but then the results are suppressed because those terms have zero d.f..:

> joint_tests(mod, show0df = TRUE)
 model term   df1    df2 F.ratio p.value note
 site           5 373231   0.000  1.0000     
 C              1 373231   0.000  1.0000     
 R              2 373231   0.000  1.0000     
 covar          0     NA      NA      NA  d e
 C:R            2 373231   0.000  1.0000     
 C:covar        0     NA      NA      NA  d e
 R:covar        0     NA      NA      NA  d e
 C:R:covar      0     NA      NA      NA  d e
 (confounded)   0 373231     NaN     NaN     

d: df1 reduced due to linear dependence 
e: df1 reduced due to non-estimability 

This in turn happens because by default, joint_tests() reduces the covariate using the meanint() function:

> meanint(dummy$covar)
[1] -0.5176167  1.4823833
> as.tran$linkfun(meanint(dummy$covar))
[1] NaN NaN
Warning messages:
1: In sqrt(mu/alpha) : NaNs produced
2: In asin(sqrt(mu/alpha)) : NaNs produced

That is, the transformation is applied to values outside its valid range. If you use a function that yields values within the interval (0,1), you get less puzzling results:

> joint_tests(mod, cov.reduce = function(x) mean(x) + c(-0.1, 0.1))
 model term df1    df2 F.ratio p.value
 site         5 373231   0.000  1.0000
 C            1 373231   0.000  1.0000
 R            2 373231   0.000  1.0000
 covar        1 373231   0.000  1.0000
 C:R          2 373231   0.000  1.0000
 C:covar      1 373231   0.000  1.0000
 R:covar      2 373231   0.000  1.0000
 C:R:covar    2 373231   0.000  1.0000

Thank you for the explanation and for correcting my dummy data set. I thought asin.sqrt took [0,1] inclusive as mod run error-free (I did see NAs for when some values on the original scale were larger than 1.) I am confused why 0 df occurred. Transforming the covar and save it as a column before fitting the model did not return any 0 df. I'm using your revised dummy data set in the next step.

dummy$covar.t <- asin(sqrt(dummy$covar))
model term  df1 df2 F.ratio p.value
 site          5  55   1.038  0.4049
 C             1  55   0.692  0.4090
 R             2  55   5.055  0.0097
 covar.t       1  55   0.000  0.9948
 C:R           2  55   0.623  0.5402
 C:covar.t     1  55   0.661  0.4197
 R:covar.t     2  55   0.513  0.6013
 C:R:covar.t   2  55   0.202  0.8173

In case covar must be transformed before fitting, would you recommend cov.reduce = \(x) mean(x) + c(-0.1, 0.1) as the default solution to get the full ANCOVA table? When exploring different options for cov.reduce or cov.keep, I saw very different pairs of F and p for the same factor using the same model.

rvlenth commented 11 months ago

If you transform first and save it as a predictor, then you are talking about a different model with covar.t as a predictor instead of covar. The joint tests are based, respectively, on a symmetric interval on the covar scale or a symmetric interval on the covar.t scale, and since the transformation in nonlinear, a symmetric interval on one scale is not symmetric on the other.

I do not have a recommendation as to which is right or which is preferable. And as is discussed on the help page, covariates create complications and there are infinitely many possible results for joint_tests() depending on what intervals you choose for covar (or t.covar). And I suggest that precisely because this is a point of confusion, it is really difficult to explain what it is you are testing with joint_tests() and so you are probably better off not using it at all.

What may be more meaningful to you might be results like this:

> joint_tests(mod, at = list(covar = c(0.25, 0.5, 0.75)), by = "covar")
covar = 0.25:
 model term df1 df2 F.ratio p.value
 site         5  55   1.038  0.4049
 C            1  55   0.041  0.8394
 R            2  55   4.804  0.0119
 C:R          2  55   0.562  0.5732

covar = 0.50:
 model term df1 df2 F.ratio p.value
 site         5  55   1.038  0.4049
 C            1  55   0.934  0.3380
 R            2  55   4.514  0.0153
 C:R          2  55   0.616  0.5439

covar = 0.75:
 model term df1 df2 F.ratio p.value
 site         5  55   1.038  0.4049
 C            1  55   1.311  0.2572
 R            2  55   1.566  0.2181
 C:R          2  55   0.365  0.6961

which gives you three different anova tables for just the factors, depending on which covariate value is used. Those tests are based on easy-to-understand hypotheses, formulated for those specific covariate values.

hnguyen19 commented 10 months ago

Thank you for the details. I was looking for the impact of the covar on the resp and wanted to see how the F-value associated with covar compared to those of the categorical variables, and then decide whether to use the covar in predicting resp.

For cases that covar is transformed with log, which is non-linear, 0 df does not occur. Is it inappropriate to call joint_tests without specifying the levels of covar?

mod5 <- lm(sqrt(resp)  ~ site  +  C * R * log(covar + 0.01), data = dummy)
 joint_tests(mod5)
 model term df1 df2 F.ratio p.value
 site         5  55   1.152  0.3445
 C            1  55   1.579  0.2142
 R            2  55   1.797  0.1755
 covar        1  55   0.040  0.8416
 C:R          2  55   0.671  0.5155
 C:covar      1  55   1.033  0.3139
 R:covar      2  55   0.060  0.9415
 C:R:covar    2  55   0.665  0.5186

In this case, though, covar should not be used, but in my real data set, the covar had the biggest impact on resp.

rvlenth commented 10 months ago

You are asking if joint_tests() should be used for model selection, and the answer is decidedly no. That function, and indeed the entire emmeans package, is designed to perform post hoc analyses of an existing model. None of it is designed for model selection. And I would never use type 3 tests for model selection anyway. You should use type-1 (anova()) or type-2 (car::Anova()) or other methods designed for selecting models.

This series of questions seems to be following a meandering path. We (or at least I) have established that this is not a bug, and now we're not even talking about the same transformation.

rvlenth commented 10 months ago

It looks like this thread is completed, so am closing

hnguyen19 commented 10 months ago

Thanks for correcting. I appreciate your help.