Closed topepo closed 4 years ago
@topepo I am now working on this issue(sent you emails before), I have one question regarding evalSummaryFunction(), current version of evalSummaryFunction() will handle y as a vector. now with "Survival" model, Surv object will be used instead, so do we have to write codes to handle surv object in this function? Or we don't need evalSummaryFunction() at all for "Survival" model. Thanks.
Apologies in advance for a long reply.
I don''t think that this will be an issue even though a Surv
object does't test as a vector:
> surv_obj <- with(lung, Surv(time, status))
> is.vector(surv_obj)
[1] FALSE
we will need to add specific code in some areas. For example, in evalSummaryFunction
, it uses length(y)
, which will be 2 times the number of data points for this type of object. In this case, new code will need to be added to get around this, such as going from
testOutput <- data.frame(pred = sample(y, min(10, length(y))),
obs = sample(y, min(10, length(y))))
to
num_data <- if(is.Surv(y)) nrow(y) else length(y)
testOutput <- data.frame(pred = sample(y, min(10, num_data)),
obs = sample(y, min(10, num_data)))
Similarly, there are a few cases in train.default
where this is an issue:
metric = ifelse(is.factor(y), "Accuracy", "RMSE")
and
modelType <- if(is.factor(y)) "Classification" else "Regression"
Two other things that this made me think of:
1) What measure of performance we should use for these models? For example, postResample
and defaultSummary
use accuracy/Kappa or RMSE/R^2 for classification and regression models. It has been a while since I've thought about accuracy/performance measures for non-Cox survival models, but I know there are a few. Unless you have recommendations (which I'd love to hear), there is always the concordance statistic in the Survival
package. For example:
set.seed(1)
in_train <- sample(1:nrow(lung), 50)
training <- lung[ in_train, ]
testing <- lung[-in_train, ]
testing_surv <- Surv(testing$time, testing$status)
options(na.action=na.exclude)
fit <- survreg(Surv(time, status) ~ ph.ecog + age + sex, data = training)
testing_pred <- predict(fit, testing)
survConcordance.fit(y = testing_surv, x = testing_pred)
However, I'm not sure why survConcordance.fit
doesn't compute the concordance statistic the way that survConcordance
does (maybe it needs the df).
2) How do we effectively resample? We want to make sure that all the censored data to not end up in the training or holdout sets. By default, createDataPartition
tries to keep the outcome distribution similar between the training and test sets by stratifying the data. One way we might do this is to follow the existing procedure for regression by temporarily binning the data into percentiles but add an additional bin for censored data to make sure that those are equally distributed too.
Max
It seems that someone has already figured #2 out: Evaluating Random Forests for Survival Analysis Using Prediction Error Curves
Thanks for the paper, @topepo
From the help file for survConcordance, "The survConcordance.fit function computes the result but does no data checking whatsoever. It is intended as a hook for other packages that wish to compute concordance, and the data has already been assembled and verified", survConcordance.fit doesn't check if the data is stratified, one may do other work based on survConcordance.fit() results.
I think concordance should be one of default measures of performance, and maybe Brier scores suggested by the paper you suggested.
I would also go for concordance as default and add Brier score as an option. There is also the c-index which is quite common, and the tAUC, implemented in the timeROC. randomSurvivalForest is deprecated, but the party package also supports survival objects. I don't know what happened to the survnnet package.
I've made some notes and added them to https://github.com/topepo/caret/tree/master/docs. Please comment/edit as you see fit.
Has any progress been made on this? I would be willing to make some contributions if not
There have not so go crazy.
Alright, I had a quick read of your notes on the topic, I understand well enough to get started. Do you have a style guide/contribution guidelines or should I just try to match what's already there?
I would do your best to match. The style has "evolved" over the last 13 years and isn't entirely consistent.
I will focus on glmnet as a "case study" of sorts as it's the one I'm most familiar with, and once the framework is in place it should be straightforward to add more.
Other than the changes mentioned above to train
and performance/summary functions, I also need to change the model data (ie in this case, models/files/glmnet.R
). Are there any other areas to look out for?
In general I'll search for any checks in the package for model type being "Regression" or "Classification" but I may not catch it all.
I would note that there is a length.Surv function in survival
so length is a non-issue.
Also, regarding the different predictions (predicted hazard/predicted survival time), I think it would be best to separate these into different summary functions, and using one as the default depending on model type. Even with Cox regression, people are often more interested in predicted survival time
Hi @topepo, some thoughts following a start at implementation:
Fundamentally the major "engineering" problem here is Q1 as you posed it, "How should the user choose the problem type for survival models?". I would say the cleanest way is to add an argument to train
, eg outcome=c("hazard", "time")
. This would be ignored for all Regression and Classificaton types, and it could be a part of each model's data as to whether they support that outcome type.
The alternative would be to expose the modelType
argument but this would require a lot of refactoring, and would be unnecessary for existing Classification and Regression functionality.
For defaultSummary
predicting time as an outcome, would the existing measures of accuracy for non-censored regression be sufficient? I am not familiar with parametric survival models and how they are appraised.
When predicting hazard, Uno's censoring-adjusted C-statistic (survAUC::UnoC
) requires both the observed survival, the predicted hazard and the original/training Surv object. This would mean altering the interface of the Summary
functions; would it be worthwhile?
Hope I'm not catching you at a busy time, but I'd like your opinion on these issues before pushing forward. Thanks!
The more I look into this, the more that I am included to write something for the newer packages that I'm working on (at least to scope things out before proceeding to caret
).
For
defaultSummary
predicting time as an outcome, would the existing measures of accuracy for non-censored regression be sufficient? I am not familiar with parametric survival models and how they are appraised.
I'll look into that a bit. Certainly things like Spearman's rank correlation are an option.
When predicting hazard, Uno's censoring-adjusted C-statistic (
survAUC::UnoC
) requires both the observed survival, the predicted hazard and the original/training Surv object. This would mean altering the interface of theSummary
functions; would it be worthwhile?
Regardless, I will be writing a new package for measuring performance for different models and the performance metrics might be better implemented there since it is a clear slate and the api won't be as complex.
Would you be interested in initially working on the performance metric part? My first thought would be to take a few examples, write some Rmarkdown documents that go through the analysis of those data so that all the issues and api's can be better scoped out.
That makes sense, from my initial work the Summary functions are where the real changes will come in, and I don't think adding more complexity to caret
itself is going to be beneficial in the long run.
Sure, I'll put together some documents over the coming weeks showing the different options and their availability/interfaces, similar to what you've put together previously. That should at least provide a foundation for the API.
Somer's D statistic is a good candidate for time outcomes
I think the "ranger" package has made some very good progress on many of the issues raised here (reg interface, evaluation metrics etc).
@Alanocallaghan do you still work on this? I don't have much experience with package development but would love to help with research etc.
It's been a while, but I think rank correlation and concordance/C/D statistics are the way to go. I think I have some RMarkdown documents somewhere showing the use of some of these, which I can try to publish in the coming days. I'd welcome any help on the research side though, which would basically entail trying to find a range of survival model types, and showing the interface for each. I was also directed by Cross-Validated to probability plotting, though I don't know if a formal measure can be easily obtained. I'm most definitely not an expert on parametric survival models.
In terms of implementation, I would need to have a look at yardstick
before going any further, if the code is likely to be put there rather than in caret
. I have a minimal working implementation for glmnet
working in caret
but haven't gone any further since.
Looking back on this, I think that concordance statistics are the way to go. While rank correlations between all values are valid, they don't account for censoring. If ignoring censored values, concordance is equivalent to Kendall's tau. Further, since concordance/Kendall's tau is concerned only with the ordinal relationship, it is valid to use concordance for both hazard and time predictions, if one inverts the order of the time predictions.
Regarding the prediction of survivor quantiles (mentioned here), I gather that this would be used to transform the problem into a classification problem. I would note that survAUC::AUC.uno
appears to implement something similar. Again, if predicted time is used, then we can simply invert the ordering in order to obtain equivalent measures.
The initial implementation should thus calculate concordance, Uno's C, and Uno's AUC based on: the predicted hazard or predicted time, the training Surv
object, and the test Surv
object.
@topepo - if I were to set up a PR with a working implementation of this for a small number of models (a mix of outcome and hazard prediction), would you consider reviewing? I could then extend this to cover whichever models support censored regression.
I would be happy to add this to yardstick
also, as the code to calculate these measures is quite simple. The first pass implementation would be to take the Surv
object for the training data and the vector of predictions (hazard or time), and return the concordance measure, with an option to invert to ensure concordance can be calculated properly for time or hazard predictions. Uno's C and AUC could be added in addition.
recipe
can be used to pass in the training Surv
object, so the use of Uno's adjusted measures is moot.
I should add that pec package has implementation for c-index as well as survival brier score which I think is also very important for the case one wants to derive exact survival probabilities.
I would like to contribute to this. Can someone lay out what the outstanding problems or things to be done are.
This is happening in tidymodels
see https://github.com/tidymodels/planning/tree/master/survival-analysis
Some things that would have to change:
train
has a component calledmodelType
that is currently either "Regression" or "Classification". Another option, "Survivial" would need to be added and a search should be made for any error traps or checks on this value.is.Surv
should be added to make sure the outcome is correctpostResample
too.