therneau / survival

Survival package for R
381 stars 104 forks source link

Reference of categorical levels variables when using Step Functions #235

Closed myriamziou closed 6 months ago

myriamziou commented 9 months ago

When defining a step function for Beta(t) for a categorical variable using survSplit() prior to coxph() (as indicated in the section 4.1 of the "Using Time Dependent Covariates and Time Dependent Coefficients in the Cox Model" vignette, the reference group does not seem to respect what was defined in the levels of the factor variable. Here is an example with a simulated small dataset:

df <- data.frame(x=rnorm(50,0,1), y=factor(sample(c(1,2,3),size=100,replace=TRUE)), t_start=rep(0,100), t_stop=round(rnorm(100,365,80)), event=c(rep(1,70),rep(0,30)))

split_df <- survSplit(Surv(t_start,t_stop,event)~.,data=df,cut=c(365),episode="tgroup")

model <- coxph(Surv(t_start,t_stop,event)~x+y:strata(tgroup),data=split_df)

The summary of the model generates that output :

Call:
coxph(formula = Surv(t_start, t_stop, event) ~ x + y:strata(tgroup), 
    data = split_df)

  n= 157, number of events= 70 

                                             coef        exp(coef) se(coef)     z Pr(>|z|)
x                                           0.06957   1.07204  0.12345  0.564    0.573
y1:strata(tgroup)tgroup=1  0.16010   1.17363  0.41782  0.383    0.702
y2:strata(tgroup)tgroup=1 -0.29659   0.74335  0.48379 -0.613    0.540
y3:strata(tgroup)tgroup=1       NA        NA  0.00000     NA       NA
y1:strata(tgroup)tgroup=2 -0.31430   0.73030  0.40524 -0.776    0.438
y2:strata(tgroup)tgroup=2  0.63261   1.88251  0.39839  1.588    0.112
y3:strata(tgroup)tgroup=2       NA        NA  0.00000     NA       NA

                          exp(coef) exp(-coef) lower .95 upper .95
x                            1.0720     0.9328    0.8416     1.366
y1:strata(tgroup)tgroup=1    1.1736     0.8521    0.5175     2.662
y2:strata(tgroup)tgroup=1    0.7433     1.3453    0.2880     1.919
y3:strata(tgroup)tgroup=1        NA         NA        NA        NA
y1:strata(tgroup)tgroup=2    0.7303     1.3693    0.3300     1.616
y2:strata(tgroup)tgroup=2    1.8825     0.5312    0.8622     4.110
y3:strata(tgroup)tgroup=2        NA         NA        NA        NA

Concordance= 0.554  (se = 0.042 )
Likelihood ratio test= 5.8  on 5 df,   p=0.3
Wald test            = 5.97  on 5 df,   p=0.3

where the reference category seems to be y=3 instead of y=1 as per the levels of y. Thank you

therneau commented 6 months ago

You are incorrect. The lm, glm, coxph, and many other routines use the core model.matrix routine to create the X matrix for regression. That routine tries it's best to predict if some columns of X will be redundant, which would create a singular X matrix, and then removes selected ones to avoid that. Which ones it chooses to remove are based on the contrasts.

It does pretty well for models with + and *, but often removes too few columns when the formula has a : (colon), which is what happened here. In that case an NA results for any column which is redundant with those to its left in the resulting X matrix.