rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
341 stars 30 forks source link

Using survreg model with a strata() term gives the error "Non-conformable elements in reference grid." #429

Closed Chris-Angell closed 11 months ago

Chris-Angell commented 1 year ago

Describe the bug

Trying to extract emmeans from a survreg model with with a strata() term fails, and gives the error "Non-conformable elements in reference grid."

To reproduce

library(survival)
model1 <- survreg(Surv(futime, fustat) ~ age + strata(rx), dist= "weibull", data=ovarian)
emmeans(model1, specs = "age")

A similar model with "rx" as a regular factor works fine with emmeans:

model2 <- survreg(Surv(futime, fustat) ~ age + rx, dist= "weibull", data=ovarian)
emmeans(model2, specs = "age")

Expected behavior

I expected emmeans() to produce an output for the emmean of age for model1 as it does for model2.

Additional context

I tried to find a workaround by manually asking ref_grid() to look at just the non-strata terms in the model. I tried ref_grid(model1, terms = drop.terms(terms(model1),3)) and got the same "Non-conformable elements" error.

I tried asking emmeans() and ref_grid() to ignore the strata values with params = but the syntax for knots from the examples in the vignette and online is not the same as for strata() and I couldn't get it to work.

Thanks for your help in advance!

rvlenth commented 1 year ago

Hmmm. Well, this happens because strata(rx) is included as a predictor, but is not accounted for in coef(model1). Which seems to be what you discerned as well, though there is no terms argument available in ref_grid(), so your code in "additional context" will not make any difference.

The puzzle is that my code for emm_basis.survreg contains this mysterious comment

    # I'm gonna delete any terms involving cluster(), or frailty() -- keep strata()

which is the crux of the problem. If I also delete strata() terms, it works fine:

> emmeans(model1, "age", at = list(age=c(50,55,60,65)))

 age emmean    SE df lower.CL upper.CL
  50   7.71 0.351 22     6.98     8.43
  55   7.19 0.240 22     6.69     7.69
  60   6.67 0.174 22     6.31     7.03
  65   6.16 0.202 22     5.74     6.57

Results are given on the log (not the response) scale. 
Confidence level used: 0.95 

It seems it is correct to also delete strata() terms. But I can't remember why I wrote that comment and made that choice some eons ago. So I wonder if something from the long-ago past is going to come back and bite me. We'll see...

I'll be pushing up a corrected version soon.

rvlenth commented 1 year ago

Here's a question for you... What does it mean if I interact one of thse strata() or cluster() or frailty()) terms with other predictors? For example,

> model2 = survreg(formula = Surv(futime, fustat) ~ age * strata(rx), data = ovarian, 
+                  dist = "weibull")
> coef(model2)
       (Intercept)                age age:strata(rx)rx=2 
      11.512555818       -0.086210384        0.007642672 

This model does not play well with emmeans. Is this model sensible in practice? And if so, why do we have a fixed effect for age:strata(rx) but not a main effect for strata(rx)?

Chris-Angell commented 1 year ago

I don't have a clue what survreg is doing there, and I'm surprised that it lets you fit interactions.

I'm not familiar with cluster(). frailty() terms are random factors, which are now deprecated from coxph() in favor of coxme() and which no longer are permitted in survreg() models at all (see the survival v3.2-9 changelog).

strata() tells survreg() to fit different values for the second parameter of the survival distribution (which survreg calls "scale") according to the levels of a factor you specify.

I would have figured survreg would have treated strata(X) * Y as equivalent to strata(X:Y), but i played around with it and that's clearly not what it's doing. It appears that your model2 (with age * strata(rx)) is equivalent to age:as.factor(rx) + strata(rx). See the below model summaries:

Call:
survreg(formula = Surv(futime, fustat) ~ age * strata(rx), data = ovarian, 
    dist = "weibull")
                      Value Std. Error     z       p
(Intercept)        11.51256    1.61275  7.14 9.4e-13
age                -0.08621    0.02458 -3.51 0.00045
age:strata(rx)rx=2  0.00764    0.00630  1.21 0.22477
rx=1               -0.46142    0.33856 -1.36 0.17292
rx=2               -0.73126    0.32295 -2.26 0.02355

Scale:
 rx=1  rx=2 
0.630 0.481 

Weibull distribution
Loglik(model)= -88.8   Loglik(intercept only)= -97.1
    Chisq= 16.57 on 2 degrees of freedom, p= 0.00025 
Number of Newton-Raphson Iterations: 6 
n= 26 
Call:
survreg(formula = Surv(futime, fustat) ~ age:as.factor(rx) + 
    strata(rx), data = ovarian, dist = "weibull")
                     Value Std. Error     z       p
(Intercept)        11.5126     1.6127  7.14 9.4e-13
age:as.factor(rx)1 -0.0862     0.0246 -3.51 0.00045
age:as.factor(rx)2 -0.0786     0.0279 -2.82 0.00487
rx=1               -0.4614     0.3386 -1.36 0.17292
rx=2               -0.7313     0.3229 -2.26 0.02355

Scale:
 rx=1  rx=2 
0.630 0.481 

Weibull distribution
Loglik(model)= -88.8   Loglik(intercept only)= -97.1
    Chisq= 16.57 on 2 degrees of freedom, p= 0.00025 
Number of Newton-Raphson Iterations: 6 
n= 26 

Why it's like that escapes me.

rvlenth commented 1 year ago

Well, strata() is just a function, and I think survreg() just uses the code in stats. Those two models generate pretty similar model matrices...

> library(survival)

> head(model.matrix(~ age * strata(rx), data = ovarian))
  (Intercept)     age strata(rx)rx=2 age:strata(rx)rx=2
1           1 72.3315              0             0.0000
2           1 74.4932              0             0.0000
3           1 66.4658              0             0.0000
4           1 53.3644              1            53.3644
5           1 50.3397              0             0.0000
6           1 56.4301              0             0.0000

> head(model.matrix(~ ~ age:as.factor(rx) + strata(rx), data = ovarian))
  (Intercept) strata(rx)rx=2 age:as.factor(rx)1 age:as.factor(rx)2
1           1              0            72.3315             0.0000
2           1              0            74.4932             0.0000
3           1              0            66.4658             0.0000
4           1              1             0.0000            53.3644
5           1              0            50.3397             0.0000
6           1              0            56.4301             0.0000

And the documentation for strata() says it's basically the same as interaction() which is pretty trivial with just one factor:

> with(ovarian, strata(rx))
 [1] rx=1 rx=1 rx=1 rx=2 rx=1 rx=1 rx=2 rx=2 rx=1 rx=2 rx=1 rx=2 rx=2 rx=2 rx=1 rx=1 rx=1 rx=1 rx=2
[20] rx=2 rx=2 rx=1 rx=1 rx=2 rx=2 rx=2
Levels: rx=1 rx=2

So maybe survreg() just takes what you give it and splits off the strata() part. And just doesn't anticipate that the user would specify a nonsense model where strata() interacts with something.

Anyway, I'm not too worried that a user will specify a model like mine, so will just let it slide.

rvlenth commented 11 months ago

Am closing this issue, as I think it is resolved

rvlenth commented 3 months ago

I changed it again... Please see comment in #473