harrelfe / rms

Regression Modeling Strategies
https://hbiostat.org/R/rms
Other
172 stars 48 forks source link

fit.mult.impute() on model with boolean/logical variables results in error message #63

Open huftis opened 6 years ago

huftis commented 6 years ago

Running fit.mult.impute() on a cph model where (at least) one of the variables is a boolean/logical variable (FALSE/TRUE) results in an error message. Here’s a reprex:

library(rms)

n = 100
d = data.frame(
  time = rexp(n),
  status = rbinom(n, 1, .7),
  age = rnorm(50, 10),
  male = sample(c(FALSE, TRUE, NA), n, replace = TRUE)
)

# Fitting the model works fine
l = cph(Surv(time, status) ~ age + male, data=d)
coef(l)
#>        age       male 
#> 0.24057808 0.08193785

# Imputing the model works fine
imp = aregImpute(~time+status+age+male, data=d)

# But fitting the model on the *imputed* data results in an error
l_imp = fit.mult.impute(formula(l), cph, imp, data = d)
#> Error in X[, mmcolnames, drop = FALSE]: subscript out of bounds

The bug occurs is the line X <- X[, mmcolnames, drop = FALSE] in cph(). For this example, the column names of X when that line is run are c("(Intercept)", "age", "maleTRUE") while the mmcolnames variable contains c("age", "male"), i.e. male instead of maleTRUE.

If one converts the male variable to a factor before running the imputation and model fitting, everything works fine:

# If the logical value is converted to a factor,
# everything works fine
d$male = factor(d$male) # or as.numeric(d$male)
imp = aregImpute(~time+status+age+male, data=d)
l_imp = fit.mult.impute(formula(l), cph, imp, data = d)
#> 
#> Variance Inflation Factors Due to Imputation:
#> 
#>       age male=TRUE 
#>      1.00      1.15 
#> 
#> Rate of Missing Information:
#> 
#>       age male=TRUE 
#>      0.00      0.13 
#> 
#> d.f. for t-distribution for Tests of Single Coefficients:
#> 
#>          age    male=TRUE 
#> 1.491268e+09 2.376500e+02 
#> 
#> The following fit components were averaged over the 5 model fits:
#> 
#>   linear.predictors means stats center
coef(l_imp)
#>         age   male=TRUE 
#>  0.14837184 -0.01095707

But since cph() works fine with logical variables, I think fit.mult.impute() with a cph() fitter should work fine too.