Adapt train for survival models

topepo commented 10 years ago

Some things that would have to change:

train has a component called modelType that is currently either "Regression" or "Classification". Another option, "Survivial" would need to be added and a search should be made for any error traps or checks on this value.
is.Surv should be added to make sure the outcome is correct
Documentation will need to be added
Test cases for specific models
New models based on this change that should be considered: superpc (add this model type), randomForestSRC, bujar, randomSurvivalForest and probably others
The default performance metric would need to change (any suggestions?). These changes should get built into postResample too.

dolremi commented 10 years ago

@topepo I am now working on this issue(sent you emails before), I have one question regarding evalSummaryFunction(), current version of evalSummaryFunction() will handle y as a vector. now with "Survival" model, Surv object will be used instead, so do we have to write codes to handle surv object in this function? Or we don't need evalSummaryFunction() at all for "Survival" model. Thanks.

topepo commented 10 years ago

Apologies in advance for a long reply.

I don''t think that this will be an issue even though a Surv object does't test as a vector:

> surv_obj <- with(lung, Surv(time, status))
> is.vector(surv_obj) 
[1] FALSE

we will need to add specific code in some areas. For example, in evalSummaryFunction, it uses length(y), which will be 2 times the number of data points for this type of object. In this case, new code will need to be added to get around this, such as going from

testOutput <- data.frame(pred = sample(y, min(10, length(y))), 
                         obs = sample(y, min(10, length(y))))

to

num_data <- if(is.Surv(y)) nrow(y) else length(y) 
testOutput <- data.frame(pred = sample(y, min(10, num_data)), 
                         obs = sample(y, min(10, num_data)))

Similarly, there are a few cases in train.default where this is an issue:

metric = ifelse(is.factor(y), "Accuracy", "RMSE")

and

modelType <- if(is.factor(y)) "Classification"  else "Regression"

Two other things that this made me think of:

1) What measure of performance we should use for these models? For example, postResample and defaultSummary use accuracy/Kappa or RMSE/R^2 for classification and regression models. It has been a while since I've thought about accuracy/performance measures for non-Cox survival models, but I know there are a few. Unless you have recommendations (which I'd love to hear), there is always the concordance statistic in the Survival package. For example:

set.seed(1)
in_train <- sample(1:nrow(lung), 50)

training <- lung[ in_train, ]
testing  <- lung[-in_train, ]
testing_surv <- Surv(testing$time, testing$status)

options(na.action=na.exclude)

fit <- survreg(Surv(time, status) ~ ph.ecog + age + sex, data = training)
testing_pred <- predict(fit, testing)

survConcordance.fit(y = testing_surv, x = testing_pred)

However, I'm not sure why survConcordance.fit doesn't compute the concordance statistic the way that survConcordance does (maybe it needs the df).

2) How do we effectively resample? We want to make sure that all the censored data to not end up in the training or holdout sets. By default, createDataPartition tries to keep the outcome distribution similar between the training and test sets by stratifying the data. One way we might do this is to follow the existing procedure for regression by temporarily binning the data into percentiles but add an additional bin for censored data to make sure that those are equally distributed too.

Max

topepo commented 10 years ago

It seems that someone has already figured #2 out: Evaluating Random Forests for Survival Analysis Using Prediction Error Curves

dolremi commented 10 years ago

Thanks for the paper, @topepo

dolremi commented 10 years ago

From the help file for survConcordance, "The survConcordance.fit function computes the result but does no data checking whatsoever. It is intended as a hook for other packages that wish to compute concordance, and the data has already been assembled and verified", survConcordance.fit doesn't check if the data is stratified, one may do other work based on survConcordance.fit() results.

I think concordance should be one of default measures of performance, and maybe Brier scores suggested by the paper you suggested.

krz commented 9 years ago

I would also go for concordance as default and add Brier score as an option. There is also the c-index which is quite common, and the tAUC, implemented in the timeROC. randomSurvivalForest is deprecated, but the party package also supports survival objects. I don't know what happened to the survnnet package.

topepo commented 9 years ago

I've made some notes and added them to https://github.com/topepo/caret/tree/master/docs. Please comment/edit as you see fit.

alanocallaghan commented 7 years ago

Has any progress been made on this? I would be willing to make some contributions if not

topepo commented 7 years ago

There have not so go crazy.

alanocallaghan commented 7 years ago

Alright, I had a quick read of your notes on the topic, I understand well enough to get started. Do you have a style guide/contribution guidelines or should I just try to match what's already there?

topepo commented 7 years ago

I would do your best to match. The style has "evolved" over the last 13 years and isn't entirely consistent.

alanocallaghan commented 7 years ago

I will focus on glmnet as a "case study" of sorts as it's the one I'm most familiar with, and once the framework is in place it should be straightforward to add more.

Other than the changes mentioned above to train and performance/summary functions, I also need to change the model data (ie in this case, models/files/glmnet.R). Are there any other areas to look out for?

In general I'll search for any checks in the package for model type being "Regression" or "Classification" but I may not catch it all.

I would note that there is a length.Surv function in survival so length is a non-issue.

Also, regarding the different predictions (predicted hazard/predicted survival time), I think it would be best to separate these into different summary functions, and using one as the default depending on model type. Even with Cox regression, people are often more interested in predicted survival time

alanocallaghan commented 7 years ago

Hi @topepo, some thoughts following a start at implementation:

Fundamentally the major "engineering" problem here is Q1 as you posed it, "How should the user choose the problem type for survival models?". I would say the cleanest way is to add an argument to train, eg outcome=c("hazard", "time"). This would be ignored for all Regression and Classificaton types, and it could be a part of each model's data as to whether they support that outcome type.
The alternative would be to expose the modelType argument but this would require a lot of refactoring, and would be unnecessary for existing Classification and Regression functionality.

For defaultSummary predicting time as an outcome, would the existing measures of accuracy for non-censored regression be sufficient? I am not familiar with parametric survival models and how they are appraised.

When predicting hazard, Uno's censoring-adjusted C-statistic (survAUC::UnoC) requires both the observed survival, the predicted hazard and the original/training Surv object. This would mean altering the interface of the Summary functions; would it be worthwhile?

Hope I'm not catching you at a busy time, but I'd like your opinion on these issues before pushing forward. Thanks!

topepo commented 7 years ago

The more I look into this, the more that I am included to write something for the newer packages that I'm working on (at least to scope things out before proceeding to caret).

For defaultSummary predicting time as an outcome, would the existing measures of accuracy for non-censored regression be sufficient? I am not familiar with parametric survival models and how they are appraised.

I'll look into that a bit. Certainly things like Spearman's rank correlation are an option.

When predicting hazard, Uno's censoring-adjusted C-statistic (survAUC::UnoC) requires both the observed survival, the predicted hazard and the original/training Surv object. This would mean altering the interface of the Summary functions; would it be worthwhile?

Regardless, I will be writing a new package for measuring performance for different models and the performance metrics might be better implemented there since it is a clear slate and the api won't be as complex.

Would you be interested in initially working on the performance metric part? My first thought would be to take a few examples, write some Rmarkdown documents that go through the analysis of those data so that all the issues and api's can be better scoped out.

alanocallaghan commented 7 years ago

That makes sense, from my initial work the Summary functions are where the real changes will come in, and I don't think adding more complexity to caret itself is going to be beneficial in the long run.

Sure, I'll put together some documents over the coming weeks showing the different options and their availability/interfaces, similar to what you've put together previously. That should at least provide a foundation for the API.

alanocallaghan commented 7 years ago

Somer's D statistic is a good candidate for time outcomes

IyarLin commented 6 years ago

I think the "ranger" package has made some very good progress on many of the issues raised here (reg interface, evaluation metrics etc).

@Alanocallaghan do you still work on this? I don't have much experience with package development but would love to help with research etc.

alanocallaghan commented 6 years ago

It's been a while, but I think rank correlation and concordance/C/D statistics are the way to go. I think I have some RMarkdown documents somewhere showing the use of some of these, which I can try to publish in the coming days. I'd welcome any help on the research side though, which would basically entail trying to find a range of survival model types, and showing the interface for each. I was also directed by Cross-Validated to probability plotting, though I don't know if a formal measure can be easily obtained. I'm most definitely not an expert on parametric survival models.

In terms of implementation, I would need to have a look at yardstick before going any further, if the code is likely to be put there rather than in caret. I have a minimal working implementation for glmnet working in caret but haven't gone any further since.

alanocallaghan commented 6 years ago

Looking back on this, I think that concordance statistics are the way to go. While rank correlations between all values are valid, they don't account for censoring. If ignoring censored values, concordance is equivalent to Kendall's tau. Further, since concordance/Kendall's tau is concerned only with the ordinal relationship, it is valid to use concordance for both hazard and time predictions, if one inverts the order of the time predictions.

Regarding the prediction of survivor quantiles (mentioned here), I gather that this would be used to transform the problem into a classification problem. I would note that survAUC::AUC.uno appears to implement something similar. Again, if predicted time is used, then we can simply invert the ordering in order to obtain equivalent measures.

The initial implementation should thus calculate concordance, Uno's C, and Uno's AUC based on: the predicted hazard or predicted time, the training Surv object, and the test Surv object.

@topepo - if I were to set up a PR with a working implementation of this for a small number of models (a mix of outcome and hazard prediction), would you consider reviewing? I could then extend this to cover whichever models support censored regression.

I would be happy to add this to yardstick also, as the code to calculate these measures is quite simple. The first pass implementation would be to take the Surv object for the training data and the vector of predictions (hazard or time), and return the concordance measure, with an option to invert to ensure concordance can be calculated properly for time or hazard predictions. Uno's C and AUC could be added in addition.

alanocallaghan commented 6 years ago

recipe can be used to pass in the training Surv object, so the use of Uno's adjusted measures is moot.

IyarLin commented 6 years ago

I should add that pec package has implementation for c-index as well as survival brier score which I think is also very important for the case one wants to derive exact survival probabilities.

anuj2054 commented 5 years ago

I would like to contribute to this. Can someone lay out what the outstanding problems or things to be done are.

topepo commented 4 years ago

This is happening in tidymodels see https://github.com/tidymodels/planning/tree/master/survival-analysis

topepo / caret

Adapt train for survival models #30