harrelfe / rms

Regression Modeling Strategies
https://hbiostat.org/R/rms
Other
175 stars 49 forks source link

Predicting from cph model with categorical variables interacting with strat term #104

Open colinorourke opened 3 years ago

colinorourke commented 3 years ago

There seems to be an issue with making predictions from a model that contains interactions between a strat term and a categorical factor. To demonstrate, I'm going to use the example from the cph help file.

library(rms) 

#example from cph help page
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
#> Error in label(age) <- "Age": could not find function "label<-"
sex <- factor(sample(c('Male','Female'), n, 
                     rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
dt <- -log(runif(n))/h
label(dt) <- 'Follow-up Time'
#> Error in label(dt) <- "Follow-up Time": could not find function "label<-"
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
units(dt) <- "Year"
#> Error in UseMethod("units<-"): no applicable method for 'units<-' applied to an object of class "c('double', 'numeric')"
dd <- datadist(age, sex)
#> Error in datadist(age, sex): could not find function "datadist"
options(datadist='dd')
S <- Surv(dt,e)
#> Error in Surv(dt, e): could not find function "Surv"

# categorize age to demonstrate issue
age2 = cut2(age, g = 3)
#> Error in cut2(age, g = 3): could not find function "cut2"

# fit model using age categories
f = cph(S ~ age2*strat(sex), x = TRUE)
#> Error in cph(S ~ age2 * strat(sex), x = TRUE): could not find function "cph"

Created on 2021-07-22 by the reprex package (v1.0.0)

Just to demonstrate the issue I'm having, I've created quantile-based age categories instead of using continuous age, then fit a model that has age category as a predictor which is interacted with sex as a stratification factor.

Now, when I try to make a prediction from this model I get an error.

#try to make prediction from model object f
predict(f, newdata = data.frame(age2 = "[43.6,54.6)", sex="Male"))
#> Error in matxv(X, coeff, kint = kint): columns in a (5) must be <= length of b (4)

Created on 2021-07-22 by the reprex package (v1.0.0)

This seems odds, but asking instead for the X matrix for those predictor values I think we see why the issue arises.

#try to get X matrix for object f
predict(f, newdata = data.frame(age2 = "[43.6,54.6)", sex="Male"), type = "x")
#>   age2[43.6,54.6) age2[54.6,98.9] age2[11.3,43.6):strat(sex)Male
#> 1               1               0                              0
#>   age2[43.6,54.6):strat(sex)Male age2[54.6,98.9]:strat(sex)Male
#> 1                              1                              0
#> attr(,"strata")
#> [1] sex=Male
#> Levels: sex=Female sex=Male

Created on 2021-07-22 by the reprex package (v1.0.0)

It looks like what's happening is that an extra column is sneaking into X called age2[11.3,43.6):strat(sex)Male which wouldn't be represented by a beta coeficient in the Cox PH model.

harrelfe commented 3 years ago

Please provide a reproducible example that does not throw any errors before predict(), i.e., where you properly require(rms) or library(rms) at the beginning of the code R can find label<- and units<-.

colinorourke commented 3 years ago

Was so excited to see reprex working I missed the errors. Sorry! Updated now with fully working code.

library(rms) 
#> Loading required package: Hmisc
#> Loading required package: lattice
#> Loading required package: survival
#> Loading required package: Formula
#> Loading required package: ggplot2
#> 
#> Attaching package: 'Hmisc'
#> The following objects are masked from 'package:base':
#> 
#>     format.pval, units
#> Loading required package: SparseM
#> 
#> Attaching package: 'SparseM'
#> The following object is masked from 'package:base':
#> 
#>     backsolve

#example from cph help page
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n, 
                     rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
dt <- -log(runif(n))/h
label(dt) <- 'Follow-up Time'
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
units(dt) <- "Year"
dd <- datadist(age, sex)
options(datadist='dd')
S <- Surv(dt,e)

# categorize age to demonstrate issue
age2 = cut2(age, g = 3)

# fit model using age categories
f = cph(S ~ age2*strat(sex), x = TRUE)

#try to make prediction from model object f
predict(f, newdata = data.frame(age2 = "[43.6,54.6)", sex="Male"))
#> Error in matxv(X, coeff, kint = kint): columns in a (5) must be <= length of b (4)

#try to get X matrix for object f
predict(f, newdata = data.frame(age2 = "[43.6,54.6)", sex="Male"), type = "x")
#>   age2[43.6,54.6) age2[54.6,98.9] age2[11.3,43.6):strat(sex)Male
#> 1               1               0                              0
#>   age2[43.6,54.6):strat(sex)Male age2[54.6,98.9]:strat(sex)Male
#> 1                              1                              0
#> attr(,"strata")
#> [1] sex=Male
#> Levels: sex=Female sex=Male

Created on 2021-09-08 by the reprex package (v1.0.0)

bdeonovic commented 2 weeks ago

Bump

such predictions fail with base coxph as well so I'm guessing this particular use case has not been accounted for in either codebase.