kogalur / randomForestSRC

DOCUMENTATION:
https://www.randomforestsrc.org/
GNU General Public License v3.0
115 stars 18 forks source link

survival analysis and cindex calculation #386

Open JieyingJiao opened 1 year ago

JieyingJiao commented 1 year ago

Hi,

I'm using rfsrc to fit a survival model, but the c-index calculation is very slow, which also makes the performance and VIMP calculation takes very long time I think.

  1. is there any way to change the built in c-index calculation function? I did a simple comparison using the get.cindex() function, and a self-defined function by using survival::survConcordance, and find out the latter one is faster.
  2. if I want to calculate the err.rate, is cindex always used for survival analysis? I know it's suggested to turn this off by using perf.type = 'none', and use get.brier.survival, but brier only gives me overall performance, not the cumulative error by tree. I want to check the error convergence by tree number. Is there any other way to do this?
  3. I saw the VIMP is calculated by permutation importance, and the error rate will be calculated. Will this again call the c-index calculation for survival analysis? Asking because it takes very long time to calculate the VIMP.

Thanks a lot for the help.

Best, Jieying

ishwaran commented 1 year ago

You should turn off all performance during training using the option perf.type="none" and then extract what you want later using the predict function. A key is that the latter function has option get.tree which allows you to pull single or ensemble trees over which you can then extract information, either using built in values, or by applying external functions.

Here's an example for pulling the C-error rate from the first 10 trees where performance is off during training.

data(pbc)
o <- rfsrc(Surv(days, status) ~ ., pbc, perf.type ="none")
predict(o,get.tree=1:10,block.size=10)$err.rate[10]
[1] 0.1942955

Here we get the cumulative error rate for the first 10 trees

predict(o,get.tree=1:10,block.size=1)$err.rate[1:10]
[1] 0.2714395 0.2423318 0.2383721 0.2276816 0.2314720 0.2241849 0.2209717 0.2072547 0.1982001 0.1942955

If you want to use the Brier score, then here's an example where we extract the OOB ensemble made up of the first 10 trees and then apply the pre-built function to it:

p <- get.brier.survival(predict(o,get.tree=1:10)) plot(p$brier.score,type="l")

For you last question (#3) unfortunately the C-index is the only available metric for survival, so all downstream performance values (like VIMP) are based on the C-index.

JieyingJiao commented 1 year ago

You should turn off all performance during training using the option perf.type="none" and then extract what you want later using the predict function. A key is that the latter function has option get.tree which allows you to pull single or ensemble trees over which you can then extract information, either using built in values, or by applying external functions.

Here's an example for pulling the C-error rate from the first 10 trees where performance is off during training.

data(pbc)
o <- rfsrc(Surv(days, status) ~ ., pbc, perf.type ="none")
predict(o,get.tree=1:10,block.size=10)$err.rate[10]
[1] 0.1942955

Here we get the cumulative error rate for the first 10 trees

predict(o,get.tree=1:10,block.size=1)$err.rate[1:10]
[1] 0.2714395 0.2423318 0.2383721 0.2276816 0.2314720 0.2241849 0.2209717 0.2072547 0.1982001 0.1942955

If you want to use the Brier score, then here's an example where we extract the OOB ensemble made up of the first 10 trees and then apply the pre-built function to it:

p <- get.brier.survival(predict(o,get.tree=1:10)) plot(p$brier.score,type="l")

For you last question (#3) unfortunately the C-index is the only available metric for survival, so all downstream performance values (like VIMP) are based on the C-index.

Thanks a lot for the response. For the last question about VIMP, I think the VIMP is also turned off if using perf.type = 'none'. Is there also a way to calculate VIMP after the model fitting with performance turned off, and using self-defined external function for c-index while calculate VIMP? I guess the function vimp() and subsample() only works for the model object that has vimp turned on.

ishwaran commented 1 year ago

Yes, you can retrieve VIMP using the predict function using the get.tree option. You should see the help file because there's a bunch of examples illustrating this.