rvlenth / emmeans

Estimated marginal means
https://rvlenth.github.io/emmeans/
340 stars 30 forks source link

Emmeans no longer recognizes stratified factors in Survival #473

Closed mdjpe closed 2 months ago

mdjpe commented 3 months ago

I sent the below to Terry Therneau (maintainer of the survival package) and he gave an unhelpful response:

"Why are you asking me about a failure in another package? Ask the author of emmeans."

As an ordinary user, all I know is that script which used to work doesn't work now. Could you take a look?

Survival models with a stratified factor appear to have stopped cooperating with emmeans.

For example, the following model, which was fully functional on R for Windows last year (and was used in my published work), currently produces errors when run in R for Ubuntu (I don't currently have R on a Windows machine, so can't check that):

cphfr <- coxph(Surv(tstart, tstop, death) ~ strata(Region) * Patties + Fume, newBHPsurv, cluster=ColonyGroup)

emm <- emmeans(cphfr, ~ Patties | Region) seem <- summary(pairs(emm, reverse = T), by = NULL, type="response", adjust = "none", infer = c(T, T)) names(seem)[1] <- "Patties effect" pander(seem)

Error in emmeans(cphfr, ~Patties | Region) : No variable named Region in the reference grid In addition: Warning message: In model.matrix.default(trms, m, contrasts.arg = object$contrasts) : variable 'strata(Region)' is absent, its contrast will be ignored

The above definitely worked in:

R version 4.2.2 (2022-10-31 ucrt) Platform: x86_64-w64-mingw32/x64 (64-bit) Running under: Windows 10 x64 (build 19044)

survival_3.4-0 emmeans_1.8.2

And is not working in:

R version 4.3.3 (2024-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS [1] survival_3.5-7 emmeans_1.10.0

rvlenth commented 3 months ago

Terry's response is correct. You can't expect him to solve an issue in another package.

That said, I am not sure how easily I can fix it. But I'll look at it. I do remember some issue with survival not long ago, and I wonder if this is related. I do not have much experience with survival analysis in my own work, and I will be traveling soon; so please be patient. 

rvlenth commented 3 months ago

There was a change in emmeans version 1.8.8 in relation to issue #429. It seems to be exactly the reverse of what you are reporting. Can you review that issue please, and let me know how yours differs?

Note

I edited and deleted from your initial submission my instructions for reporting issues. The idea is for you to use the same sectioning (the prompts that start with ##) and replace the instructions with your comments relevant to those instructions. I would appreciate your following that practice in the future.

rvlenth commented 3 months ago

PS -- I cannot reproduce your issue because I do not have the dataset newBPHsurv. My issue guidelines clearly say that I need a reproducible example. So I will close this issue until you re-open it and either provide the dataset or prepare a similar example using data I can access, e.g. a dataset in the survey package. (Please re-open this issue; do not create a new issue.)

mdjpe commented 3 months ago

I don't see a re-open button, so maybe the comments section will do.

Here is a reproducible example (cox model borrowed from Therneau: "A package for survival analysis in R". My copy is dated Aug 5, 2022.):

require(survival) Loading required package: survival require(emmeans) Loading required package: emmeans

cfit2 <- coxph(Surv(time, status) ~ age + sex + wt.loss + strata(inst), data=lung)

emm <- emmeans(cfit2, ~ inst) Error in emmeans(cfit2, ~inst) : No variable named inst in the reference grid

Specifying strata(inst) instead does not help:

emm <- emmeans(cfit2, ~ strata(inst)) Error in emmeans(cfit2, ~strata(inst)) : No variable named inst in the reference grid

Other variables in the model produce the expected results:

emm <- emmeans(cfit2, ~ wt.loss) emm wt.loss emmean SE df asymp.LCL asymp.UCL 9.78 0.679 0.736 Inf -0.763 2.12

Replacing strata(inst) with inst in the model allows emmeans to be generated (but presumably violates the proportional hazards assumption)

cfit222 <- coxph(Surv(time, status) ~ age + sex + wt.loss + inst, data=lung) emm <- emmeans(cfit222, ~ inst) emm inst emmean SE df asymp.LCL asymp.UCL 11 0.41 0.679 Inf -0.92 1.74

My own case requires an interaction between a treatment and a stratified factor. Emmeans are not available for either the stratified factor or the interacting treatment. For example,

cfit222dd <- coxph(Surv(time, status) ~ strata(sex)*age, data=lung) emm <- emmeans(cfit222dd, ~ sex)

Error in emmeans(cfit222dd, ~sex) : No variable named sex in the reference grid In addition: Warning message: In model.matrix.default(trms, m, contrasts.arg = object$contrasts) : variable 'strata(sex)' is absent, its contrast will be ignored

emm <- emmeans(cfit222dd, ~ age) Warning message: In model.matrix.default(trms, m, contrasts.arg = object$contrasts) : variable 'strata(sex)' is absent, its contrast will be ignored emm Error in X[ii, ii, drop = FALSE] %*% y[ii] : non-conformable arguments

Below is the full output from sessionInfo()

sessionInfo() R version 4.3.3 (2024-02-29) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
[5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8
[7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

time zone: America/Winnipeg tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] emmeans_1.10.0 survival_3.5-7

loaded via a namespace (and not attached): [1] compiler_4.3.3 Matrix_1.6-5 estimability_1.4.1 tools_4.3.3
[5] coda_0.19-4.1 mvtnorm_1.2-4 splines_4.3.3 grid_4.3.3
[9] xtable_1.8-4 lattice_0.22-5

mdjpe commented 3 months ago

Hi Dr. Lenth. I cannot find a way to re-open the issue, and discussion on Stack Exchange seems to indicate that I won't be able to beause I am not a collaborator. I have added reproducible examples to the comments section of the closed issue, and I am copying them below.

Michael Peirson

I don't see a re-open button, so maybe the comments section will do.

Here is a reproducible example (cox model borrowed from Therneau: "A package for survival analysis in R". My copy is dated Aug 5, 2022.):

require(survival)
Loading required package: survival
require(emmeans)
Loading required package: emmeans

cfit2 <- coxph(Surv(time, status) ~ age + sex + wt.loss + strata(inst),

data=lung)

emm <- emmeans(cfit2, ~ inst)
Error in emmeans(cfit2, ~inst) :
No variable named inst in the reference grid

Specifying strata(inst) instead does not help:

emm <- emmeans(cfit2, ~ strata(inst))
Error in emmeans(cfit2, ~strata(inst)) :
No variable named inst in the reference grid

Other variables in the model produce the expected results:

emm <- emmeans(cfit2, ~ wt.loss)
emm
wt.loss emmean SE df asymp.LCL asymp.UCL
9.78 0.679 0.736 Inf -0.763 2.12

Replacing strata(inst) with inst in the model allows emmeans to be generated (but presumably violates the proportional hazards assumption)

cfit222 <- coxph(Surv(time, status) ~ age + sex + wt.loss + inst,

data=lung) emm <- emmeans(cfit222, ~ inst) emm inst emmean SE df asymp.LCL asymp.UCL 11 0.41 0.679 Inf -0.92 1.74

My own case requires an interaction between a treatment and a stratified factor. Emmeans are not available for either the stratified factor or the interacting treatment. For example,

cfit222dd <- coxph(Surv(time, status) ~ strata(sex)*age, data=lung)
emm <- emmeans(cfit222dd, ~ sex)

Error in emmeans(cfit222dd, ~sex) : No variable named sex in the reference grid In addition: Warning message: In model.matrix.default(trms, m, contrasts.arg = object$contrasts) : variable 'strata(sex)' is absent, its contrast will be ignored

emm <- emmeans(cfit222dd, ~ age)
Warning message:
In model.matrix.default(trms, m, contrasts.arg = object$contrasts) :
variable 'strata(sex)' is absent, its contrast will be ignored
emm
Error in X[ii, ii, drop = FALSE] %*% y[ii] : non-conformable arguments

Below is the full output from sessionInfo()

sessionInfo()
R version 4.3.3 (2024-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0 LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=en_CA.UTF-8 LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

time zone: America/Winnipeg tzcode source: system (glibc)

attached base packages: [1] stats graphics grDevices utils datasets methods base

other attached packages: [1] emmeans_1.10.0 survival_3.5-7

loaded via a namespace (and not attached): [1] compiler_4.3.3 Matrix_1.6-5 estimability_1.4.1 tools_4.3.3 [5] coda_0.19-4.1 mvtnorm_1.2-4 splines_4.3.3 grid_4.3.3 [9] xtable_1.8-4 lattice_0.22-5

On Sun, Mar 31, 2024 at 9:18 PM Russell V. Lenth @.***> wrote:

PS -- I cannot reproduce your issue because I do not have the dataset newBPHsurv. My issue guidelines clearly say that I need a reproducible example. So I will close this issue until you re-open it and either provide the dataset or prepare a similar example using data I can access, e.g. a dataset in the survey package. (Please re-open this issue; do not create a new issue.)

— Reply to this email directly, view it on GitHub https://github.com/rvlenth/emmeans/issues/473#issuecomment-2029042966, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHN62H4SCJJK6LUZCQTZPE3Y3C7YZAVCNFSM6AAAAABFQPIYEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRZGA2DEOJWGY . You are receiving this because you authored the thread.Message ID: @.***>

mdjpe commented 3 months ago

I agree, issue 429 looks like a related problem except using survreg instead of coxph.

On July 10th, in that thread you asked about interactions with stratified factors.

Here's a question for you... What does it mean if I interact one of thse strata() or cluster() or frailty()) terms with other predictors?

I can't say I fully understand what's going on under the hood (sorry, biologist not statistician by training), but I'll take a stab at this. I think the cox model calculates a baseline survival curve for the whole dataset based on all factors being at their first level; the estimates for the other levels of all factors assume that the risk in any given period is proportional to the baseline.

When a factor is stratified, the risk of death is not assumed to be proportional among levels of that factor over time. Separate baseline survival curves are calculated for each level of the stratified factor. The effect of that factor is not estimated because hazards are not assumed to be proportional over time among the levels of the factor. However, estimates for the other factors still apply; to get the correct survival probability at any time you multiply the appropriate baseline risk curve by the effect estimate for the treatment of interest (multiply might be the wrong word; there's definitely an exponent involved somehow but I can't remember the math). The interaction of a treatment with a stratified factor simply allows the effect estimate of the treatment to be different at different levels of the stratified factor; to get the actual survival probability you have to combine the effect estimate with the correct baseline risk curve.

As I write this, it seems to me that it is correct for (using the models from my message sent earlier today)

cfit2 <- coxph(Surv(time, status) ~ age + sex + wt.loss + strata(inst),

data=lung) emm <- emmeans(cfit2, ~ inst)

not to produce a result, because there is no effect estimate for the stratified variable inst.

However, after

cfit222dd <- coxph(Surv(time, status) ~ strata(sex)*age, data=lung)

emm <- emmeans(cfit222dd, ~ age) and emm <- emmeans(cfit222dd, ~ age | sex)

should both work.

I can't remember what survreg does, so not sure how this fits together with the previous issue.

Michael Peirson

On Sun, Mar 31, 2024 at 8:58 PM Russell V. Lenth @.***> wrote:

There was a change in emmeans version 1.8.8 in relation to issue #429 https://github.com/rvlenth/emmeans/issues/429. It seems to be exactly the reverse of what you are reporting. Can youreview that issue please, and let me know how yours differs?

— Reply to this email directly, view it on GitHub https://github.com/rvlenth/emmeans/issues/473#issuecomment-2029026807, or unsubscribe https://github.com/notifications/unsubscribe-auth/BHN62H62RANRBCEOD6NUU6TY3C5MTAVCNFSM6AAAAABFQPIYEKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAMRZGAZDMOBQG4 . You are receiving this because you authored the thread.Message ID: @.***>

rvlenth commented 3 months ago

An issue can be reopened via the "Reopen with comment" button just next to the "Comment" button below the comment compose window. Similarly, the issue can be closed with the "Close with comment" button. I think that since you are listed as a "participant" on this issue, I think that you can reopen it; and I think I have experienced this happening with past issue submitters. But maybe I'm wrong.

I asked you to look at issue #429, but you offer no comment on that. To tell the truth, I am surprised that your example worked in the past, due to that very bug. I suppose it has to do with the interaction. And if so (see below), some coeffficients are not available, and it's possible that what seemed to "work" in fact didn't, so that the results obtained are nonsense.

This seems to be a no-win situation. I fixed #429 on July 10, 2023 by disallowing strata factors. Restoring strata factors will possibly fix the problem here, but will revive the bug in #429.

Meanwhile, I've looked at various articles on the web about strata in Cox regression, and several of them say that you cannot do inference about the stratification factors. So that, along with other considerations, suggests to me that it is correct to not allow stratification factors in the reference grid. My intuition suggests that strata are something akin to random block effects in a linear model. This seems to be borne-out by the coefficient of inst not being included in the summary:

> summary(cfit2)
Call:
coxph(formula = Surv(time, status) ~ age + sex + wt.loss + strata(inst), 
    data = lung)

  n= 213, number of events= 151 
   (15 observations deleted due to missingness)

             coef exp(coef)  se(coef)      z Pr(>|z|)
age      0.023535  1.023814  0.010415  2.260  0.02383
sex     -0.516036  0.596882  0.190617 -2.707  0.00679
wt.loss -0.001698  0.998303  0.006848 -0.248  0.80413

        exp(coef) exp(-coef) lower .95 upper .95
age        1.0238     0.9767    1.0031    1.0449
sex        0.5969     1.6754    0.4108    0.8672
wt.loss    0.9983     1.0017    0.9850    1.0118

Concordance= 0.637  (se = 0.034 )
Likelihood ratio test= 14.06  on 3 df,   p=0.003
Wald test            = 13.32  on 3 df,   p=0.004
Score (logrank) test = 13.67  on 3 df,   p=0.003

For the other model, we get

> summary(cfit222dd)
Call:
coxph(formula = Surv(time, status) ~ strata(sex) * age, data = lung)

  n= 228, number of events= 165 

                          coef exp(coef)  se(coef)      z Pr(>|z|)
age                   0.019064  1.019247  0.011353  1.679   0.0931
strata(sex)sex=2:age -0.008353  0.991682  0.019336 -0.432   0.6657

                     exp(coef) exp(-coef) lower .95 upper .95
age                     1.0192     0.9811    0.9968     1.042
strata(sex)sex=2:age    0.9917     1.0084    0.9548     1.030

Concordance= 0.546  (se = 0.026 )
Likelihood ratio test= 3.37  on 2 df,   p=0.2
Wald test            = 3.29  on 2 df,   p=0.2
Score (logrank) test = 3.29  on 2 df,   p=0.2

This summary does include coefficients for the interaction, but not a main effect of strata(sex). But I wonder if this is just an oversight, or if this model makes sense. I don't know.

Adding further confusion is that strata() is just a function, and though its documentation says that it is designed for use with coxph models, we could in fact use it with any model, e.g. one fitted with lm():

> junk <- lm(time ~ age + sex + wt.loss + strata(inst), data = lung)
> emmeans(junk, "sex")
 sex emmean   SE  df lower.CL upper.CL
   1    276 25.3 192      226      326
   2    336 26.3 192      284      388
>    . . .
> junk2 <- lm(time ~ age + sex + wt.loss + inst, data = lung)
> emmeans(junk2, "sex")
 sex emmean   SE  df lower.CL upper.CL
   1    294 19.0 208      256      331
   2    344 23.2 208      298      390

The model junk2 makes some sense, at least mechanically, whereas junk itself is allowed, but doesn't make sense and yields different results. Something like that could explain why I was at first mistaken in allowing stratification factors as factors of inferential interest.

The bottom line

I cannot allow stratification factors as factors of inferential interest in Cox models, because even if it's appropriate to do so in the thinking of some users, we don't have regression coefficients, nor variances and covariances for them. I do not know why it seemed to work in earlier versions, but I am guessing that those results are suspect.

rvlenth commented 3 months ago

OK, here's the deal. I don't know what's right -- even after going through the user manual for survival and looking at all 136 cases where the word "strata" is used (many, to my irritation, being grammatical errors where the author meant "stratum"). So I decided to play along with it. That means allowing any factors in strata in the reference grid but deleting columns of the X matrix if there are no corresponding coefficients. So now, everything runs without error, as demonstrated below. But please note that we can have no main effect of a strata() factor, so if you have strata(x) as an additive term in a model, you get the same results for each level of x -- and if you compare levels of x, you'll get zeros.

The changes I made also side-step the bug observed in #429.

Just keep in mind that I'm not keeping you from doing anything that might be nonsensical -- it's on you to do appropriate analyses.

## What the hell to do about `strata()`?

library(survival)
library(emmeans)

#### Additive model
moda <- coxph(Surv(time, status) ~ wt.loss + age + strata(sex), data = lung)
#### Interaction model
modi <- coxph(Surv(time, status) ~ wt.loss + age*strata(sex), data = lung)
### survreg additive model
sra <- survreg(Surv(time, status) ~ wt.loss + age + strata(sex), data = lung)
### survreg interactive model
sri <- survreg(Surv(time, status) ~ wt.loss + age * strata(sex), data = lung)

at = list(age = c(20,30))

emmeans(moda, ~ age*sex, at = at)
##  age sex emmean    SE  df asymp.LCL asymp.UCL
##   20   1  0.386 0.203 Inf  -0.01142     0.783
##   30   1  0.578 0.296 Inf  -0.00178     1.158
##   20   2  0.386 0.203 Inf  -0.01142     0.783
##   30   2  0.578 0.296 Inf  -0.00178     1.158
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

emmeans(modi, ~ age*sex, at = at)
##  age sex emmean    SE  df asymp.LCL asymp.UCL
##   20   1  0.459 0.247 Inf   -0.0255     0.943
##   30   1  0.688 0.364 Inf   -0.0259     1.402
##   20   2  0.247 0.332 Inf   -0.4042     0.897
##   30   2  0.369 0.492 Inf   -0.5956     1.335
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

emmeans(sra, ~ age*sex, at = at)
##  age sex emmean    SE  df lower.CL upper.CL
##   20   1   6.70 0.311 209     6.09     7.32
##   30   1   6.56 0.242 209     6.08     7.04
##   20   2   6.70 0.311 209     6.09     7.32
##   30   2   6.56 0.242 209     6.08     7.04
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

emmeans(sri, ~ age*sex, at = at)
##  age sex emmean    SE  df lower.CL upper.CL
##   20   1   6.58 0.310 208     5.97     7.20
##   30   1   6.43 0.243 208     5.95     6.91
##   20   2   6.70 0.308 208     6.09     7.31
##   30   2   6.61 0.241 208     6.13     7.08
## 
## Results are given on the log (not the response) scale. 
## Confidence level used: 0.95

Created on 2024-04-04 with reprex v2.0.2

mdjpe commented 2 months ago

Hi, Dr. Lenth. Following your revisions, I have run the script from my previous work and it returns the results that it used to, which is correct according to my understanding. That is, it gives estimates for the non-stratified factor at each level of the stratified factor, and they are the same estimates as previously. Thank you for your help.