sebp / scikit-survival

Survival analysis built on top of scikit-learn
GNU General Public License v3.0
1.14k stars 216 forks source link

Implement feature_importances_ in sksurv.ensemble.RandomSurvivalForest #140

Open mtomaszewski95 opened 4 years ago

mtomaszewski95 commented 4 years ago

Implement featureimportances in sksurv.ensemble.RandomSurvivalForest. Examples: https://cran.r-project.org/web/packages/randomForestSRC/randomForestSRC.pdf https://square.github.io/pysurvival/models/random_survival_forest.html https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6364686/

sebp commented 4 years ago

Feature importances based on node/split statistics are rather flawed (see e.g. this paper). Therefore, I'm hesitant to implement this feature. In particular, you can already compute permutation-based feature importance via ELI5. It is more expensive to compute, but has better properties.

funnell commented 1 year ago

My vote would be for adding the feature, at the very least for compatibility with scikit-learn.

sebp commented 1 year ago

sklearn has https://scikit-learn.org/stable/modules/generated/sklearn.inspection.permutation_importance.html#sklearn.inspection.permutation_importance now, which is the much better option.

funnell commented 1 year ago

Yes, thanks! I understand your point of view, and that there are alternative ways to compute importance. Still, even if it's not an ideal algorithm, it can still be nice to have. Some things presume feature_importances_ is available (e.g. RFECV) and not having it might add a little friction for new scikit-survival users already familiar with scikit-learn. It's also a lot faster which can be helpful during early iteration.

Thanks for the package and thanks for considering! :)

anwurl commented 10 months ago

I also have a use-case where I am only interested in which feature are used or not used. For that, the feature importances based on node/split statistics could do the job and would be quick to calculate. In contrast, the calculation of permutation feature importances takes so much longer.

Thanks a lot for this package and your work.

sebp commented 10 months ago

Feature importances based on split criteria have been requested in the past. Unfortunately, the way sklearn implemented feature importance in the tree-growing algorithm doesn't work with the log-rank criteria used to grow the survival tree. The log-rank criteria measures the quality of the split, but sklearn assumes feature importance measure the purity of a node.