sahirbhatnagar / cbpaper

Source code for Casebase paper
Other
2 stars 1 forks source link

glmnet with Surv object Works, getting the Absolute risk back is a process #8

Closed Jesse-Islam closed 5 years ago

Jesse-Islam commented 5 years ago

The following is where I am at while trying to get a direct comparison out of the survival package. I can use glmnet on a Surv object, effectively fitting the hazard with glmnet, however, getting an absolute risk out of it is a less obvious task. There is a line of #'s at the two main points of interest, while the rest is just what is needed.

library(reprex)
#> Warning: package 'reprex' was built under R version 3.6.1
setwd("C:/Users/J/OneDrive - McGill University/github/regcasebase")
#these packages were taken from github on 23 July 2019
library(TCGA2STAT)
#these packages were pulled from bioconductor
library(BiocManager)
library(glmnet)
#> Warning: package 'glmnet' was built under R version 3.6.1
#> Loading required package: Matrix
#> Loading required package: foreach
#> Warning: package 'foreach' was built under R version 3.6.1
#> Loaded glmnet 2.0-18
library(doParallel)
#> Warning: package 'doParallel' was built under R version 3.6.1
#> Loading required package: iterators
#> Loading required package: parallel
library(survival)
library(splines)
library(pacman)
pacman::p_install_gh('sahirbhatnagar/casebase')
#> Skipping install of 'casebase' from a github remote, the SHA1 (aecb816f) has not changed since last install.
#>   Use `force = TRUE` to force installation
#> 
#> The following packages were installed:
#> casebase
library(casebase)
#> See example usage at http://sahirbhatnagar.com/casebase/
#lusc.rnaseq2 <- getTCGA(disease="LUSC", data.type="RNASeq2", clinical=TRUE)
#line above is to directly import, below is to save time
lusc.rnaseq2 <-readRDS('lusc.rnaseq2.rds')
#remove rows with NA
highDimSurvData=na.omit(lusc.rnaseq2$merged.dat)
sam=sample(nrow(highDimSurvData), 494)
partHighDimSurvData=highDimSurvData[sam, ]
y=as.matrix(partHighDimSurvData[,c(2,3)])
x=as.matrix(partHighDimSurvData[,c(4:50)])
#x=as.matrix(partHighDimSurvData[,c(4:300)])
new_data=as.data.frame(t(x[-c(sam), ]))
km <- Surv(time = highDimSurvData$OS, event = highDimSurvData$status)
daty=as.data.frame(y)
x2 <- x[y[,2] >= 0.00001, ]
y2<- y[y[,2] >= 0.00001, ]
u=Surv(time = y2[,2], event = y2[,1])

#####################################################
#The following code produces useable coefficients,
#however no clean way to produce an absolute risk function this way
coxfit=cv.glmnet(x=x2,y=u, family="cox")

#Following code is an attempt to get coxfit to agree with absoluteRisk function
coxfit$originalData <- list("x" = x2,
                      "y" = u)
coxfit$typeEvents <-sort(unique(y2[,1]))
  coxfit$timeVar <- "OS"
  coxfit$eventVar <- "status"
  coxfit$matrix.fit <- TRUE
  coxfit$formula_time <- formula(paste("~ OS",sep=""))
  coxfit$offset<- NULL
  new_data=as.data.frame(t(x2[-c(sam), ]))

#################################################################
  #this is where my code has yet to agree with the absolute Risk function
  abcoxfit=absoluteRisk(coxfit,time = seq(0,500, 1),newdata = new_data)
#> Warning in data.frame(..., check.names = FALSE): row names were found from
#> a short variable and have been discarded
#> Error in newx %*% nbeta: Cholmod error 'X and/or Y have wrong dimensions' at file ../MatrixOps/cholmod_sdmult.c, line 90

Created on 2019-08-11 by the reprex package (v0.3.0) `

sessionInfo()
#> R version 3.6.0 (2019-04-26)
#> Platform: x86_64-w64-mingw32/x64 (64-bit)
#> Running under: Windows 10 x64 (build 17763)
#> 
#> Matrix products: default
#> 
#> locale:
#> [1] LC_COLLATE=English_United States.1252 
#> [2] LC_CTYPE=English_United States.1252   
#> [3] LC_MONETARY=English_United States.1252
#> [4] LC_NUMERIC=C                          
#> [5] LC_TIME=English_United States.1252    
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] compiler_3.6.0  magrittr_1.5    tools_3.6.0     htmltools_0.3.6
#>  [5] yaml_2.2.0      Rcpp_1.0.2      stringi_1.4.3   rmarkdown_1.14 
#>  [9] highr_0.8       knitr_1.23      stringr_1.4.0   xfun_0.8       
#> [13] digest_0.6.20   evaluate_0.14

Created on 2019-08-11 by the reprex package (v0.3.0)

Jesse-Islam commented 5 years ago

The following is my attempt at coding a breslow estimator that can handle any number of variables.. still buggy based on output.

library(survival)
library(casebase)
#> See example usage at http://sahirbhatnagar.com/casebase/
library(glmnet)
#> Warning: package 'glmnet' was built under R version 3.6.1
#> Loading required package: Matrix
#> Loading required package: foreach
#> Warning: package 'foreach' was built under R version 3.6.1
#> Loaded glmnet 2.0-18
#process for creating a manual absolute risk function
data=ERSPC

u=Surv(time = data$Follow.Up.Time, event = data$DeadOfPrCa)

xa=as.data.frame(data[,1])
xa$normal=rnorm(length(xa[,1]),mean=0,sd=6)
xa$wernormal=rnorm(length(xa[,1]),mean=5,sd=5)
x=as.matrix(xa)
coxfit=cv.glmnet(x=x,y=u, family="cox",alpha=0)

coxcoef=coef(coxfit)
betaHat=coxcoef@x
tab <- data.frame(table(data[data$DeadOfPrCa == 1, "Follow.Up.Time"])) 
y <- as.numeric(levels(tab[, 1]))[tab[, 1]] #ordered distinct event times
d <- tab[, 2] 

h0 <- rep(NA, length(y))
for(l in 1:length(y))
{
  h0[l] <- d[l] / sum(exp(x[data$Follow.Up.Time>=y[l],] %*% betaHat))
}
plot(h0,type="l")

Created on 2019-08-11 by the reprex package (v0.3.0)

turgeonmaxime commented 5 years ago

I think we can use the survival::survfit function to get the survival function as follows:

library(survival)
library(glmnet)
#> Loading required package: Matrix
#> Loading required package: foreach
#> Loaded glmnet 2.0-18

load(paste0(find.package("glmnet"), "/data/CoxExample.RData"))
fit <- cv.glmnet(x, y, family = "cox")

nonzero_covariate <- predict(fit, type = "nonzero", s = "lambda.1se")
nonzero_coef <- coef(fit, s = "lambda.1se")

# Fit Cox with selected covariates
fit_cox <- coxph(Surv(time, status) ~ ., 
                 data = as.data.frame(cbind(y, x[,nonzero_covariate$X1])))

# Change coefficients from fit_cox to those of fit
fit_cox2 <- fit_cox
fit_cox2$coefficients <- nonzero_coef@x

# Plot survival functions
plot(survfit(fit_cox))
lines(survfit(fit_cox2), col = 'blue',
      conf.int = FALSE)

Created on 2019-08-13 by the reprex package (v0.3.0)

Essentially, we fit coxnet, we get the selected coefficients and their corresponding estimates. Using only those covariates, we fit a Cox model and then replace the estimated coefficients by their glmnet estimates. In the graph above, we get the Cox regression curve in black (with 95% confidence interval) and the Coxnet curve in blue.

However, we can't really plot the confidence interval because we have the "wrong" covariance matrix.