mclements / rstpm2

An R package for generalised survival models
28 stars 11 forks source link

Unexpected coding of factor variables in aft() #32

Open alexploner opened 1 year ago

alexploner commented 1 year ago

Running the default example from the aft help page yields an estimate for the numerical binary exposure hormon as expected, with the intercept seemingly integrated into the baseline spline term nsx():

> summary(aft(Surv(rectime,censrec==1)~hormon,data=brcancer,df=4))
Maximum likelihood estimation

Call:
bbmle::mle2(minuslogl = negll, start = coef, eval.only = TRUE, 
    vecpar = TRUE, gr = gradient, control = control)

Coefficients:
                                      Estimate Std. Error  z value     Pr(z)    
hormon                                0.285474   0.094189   3.0309  0.002438 ** 
nsx(logtstar, df, intercept = TRUE)1 -0.272989   0.235853  -1.1575  0.247086    
nsx(logtstar, df, intercept = TRUE)2  2.045560   0.214611   9.5315 < 2.2e-16 ***
nsx(logtstar, df, intercept = TRUE)3 -7.387203   0.586909 -12.5866 < 2.2e-16 ***
nsx(logtstar, df, intercept = TRUE)4  4.361118   0.346428  12.5888 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-2 log L: 5215.435 

However, if I include the hormon-variable as a factor, I get contrast coding with two df / parameters for hormonal exposure:

> brcancer$hormon_factor <- factor(brcancer$hormon)
> summary(aft(Surv(rectime,censrec==1)~hormon_factor,data=brcancer,df=4))
Maximum likelihood estimation

Call:
bbmle::mle2(minuslogl = negll, start = coef, eval.only = TRUE, 
    vecpar = TRUE, gr = gradient, control = control)

Coefficients:
                                      Estimate Std. Error z value     Pr(z)    
hormon_factor0                        0.215612   0.146446  1.4723  0.140939    
hormon_factor1                        0.471117   0.148454  3.1735  0.001506 ** 
nsx(logtstar, df, intercept = TRUE)1 -0.083068   0.212022 -0.3918  0.695212    
nsx(logtstar, df, intercept = TRUE)2  1.806920   0.228551  7.9060 2.658e-15 ***
nsx(logtstar, df, intercept = TRUE)3 -6.229016   0.903396 -6.8951 5.382e-12 ***
nsx(logtstar, df, intercept = TRUE)4  4.041162   0.339188 11.9142 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

-2 log L: 5214.279 

I don't know whether this is intended, but it is not what I would expect when using a standard formula interface in an R modelling function.

Comments:

  1. I assume that doing my own dummy coding for any factor variable included as predictor would do the trick.
  2. Not sure whether this is related, but setting up starting values for the parameters (including removing the intercept), as seen in the aft-code below, seems to suffer from copy & paste:
    lm0.obj <- if (cure) lm(logHhat ~ nsx(logtstar, df, intercept = TRUE, cure = cure) - 1, dataEvents)
              else      lm(logHhat ~ nsx(logtstar, df, intercept = TRUE) - 1, dataEvents)

System info:

> packageVersion("rstpm2")
[1] ‘1.6.2’
> R.version
               _                           
platform       x86_64-pc-linux-gnu         
arch           x86_64                      
os             linux-gnu                   
system         x86_64, linux-gnu           
status                                     
major          4                           
minor          3.1                         
year           2023                        
month          06                          
day            16                          
svn rev        84548                       
language       R                           
version.string R version 4.3.1 (2023-06-16)
nickname       Beagle Scouts    
mclements commented 2 months ago

@alexploner :

Belatedly, the reason that the two models are different is that I thought it was a good idea to include the intercept term in the baseline time function. For rstpm2::stpm2, I used intercept=FALSE for the splines, whereas I use intercept=TRUE for aft().

One consequence is that the factors do not know about the intercept and that we get two intercepts. Interestingly, the model is full rank (does this make sense?), but we get strong correlations:

library(rstpm2)
summary(fit1 <- aft(Surv(rectime,censrec==1)~hormon,data=rstpm2::brcancer,df=4))
summary(fit2 <- aft(Surv(rectime,censrec==1)~factor(hormon),data=rstpm2::brcancer,df=4))
vcov(fit1) |> cov2cor() |> "rownames<-"(NULL) |> "colnames<-"(NULL)
vcov(fit2) |> cov2cor() |> "rownames<-"(NULL) |> "colnames<-"(NULL)

> vcov(fit1) |> cov2cor() |> "rownames<-"(NULL) |> "colnames<-"(NULL)
            [,1]        [,2]       [,3]       [,4]       [,5]
[1,]  1.00000000  0.03381084 -0.2649664  0.4061007 -0.2745472
[2,]  0.03381084  1.00000000  0.6129103 -0.6198557  0.6962407
[3,] -0.26496641  0.61291030  1.0000000 -0.8911680  0.8973204
[4,]  0.40610066 -0.61985566 -0.8911680  1.0000000 -0.9320684
[5,] -0.27454721  0.69624071  0.8973204 -0.9320684  1.0000000
> vcov(fit2) |> cov2cor() |> "rownames<-"(NULL) |> "colnames<-"(NULL)
           [,1]       [,2]       [,3]        [,4]       [,5]       [,6]
[1,]  1.0000000  0.8048079 0.43799608 -0.69524806  0.8652693 -0.5323502
[2,]  0.8048079  1.0000000 0.33004594 -0.66902577  0.8314087 -0.5573719
[3,]  0.4379961  0.3300459 1.00000000  0.03054923  0.1271566  0.2735835
[4,] -0.6952481 -0.6690258 0.03054923  1.00000000 -0.8977969  0.8145789
[5,]  0.8652693  0.8314087 0.12715658 -0.89779689  1.0000000 -0.8025665
[6,] -0.5323502 -0.5573719 0.27358345  0.81457893 -0.8025665  1.0000000

@bakynkozhayev : any idea whether removing the intercept from the AFT baseline will affect your AFT paper?

Sincerely, Mark.