sebp / scikit-survival

Survival analysis built on top of scikit-learn
GNU General Public License v3.0
1.13k stars 215 forks source link

Add criterion to sksurv.ensemble.RandomSurvivalForest #108

Open arturomoncadatorres opened 4 years ago

arturomoncadatorres commented 4 years ago

It would be fantastic to have criterion (i.e., the function to measure the quality of a split) as a parameter of RandomSurvivalForest. I know that currently only the log-rank splitting rule is supported. For now, this could be set as the default (and only option). In the future, this could be expanded to cover other options (for example, from the original paper conservation, log_rank_score_rule, log_rank_random) - changing the corresponding splitting code as well. This would also make the RandomSurvivalForest more similar to its scikit counterparts (e.g., RandomForestRegressor), making it (even) more compatible with other packages that build on scikit's standard structure.

I think this could be done easily in forest.py:

    def __init__(self,
                 n_estimators=100,
                 #-->
                 criterion="log_rank",
                 #-->
                 max_depth=None,
                 min_samples_split=6,
                 min_samples_leaf=3,
                 min_weight_fraction_leaf=0.,
                 max_features="auto",
                 max_leaf_nodes=None,
                 bootstrap=True,
                 oob_score=False,
                 n_jobs=None,
                 random_state=None,
                 verbose=0,
                 warm_start=False):
        super().__init__(
            base_estimator=SurvivalTree(),
            n_estimators=n_estimators,
            #-->
            criterion=criterion,
            #-->
            estimator_params=("max_depth",
                              "min_samples_split",
                              "min_samples_leaf",
                              "min_weight_fraction_leaf",
                              "max_features",
                              "max_leaf_nodes",
                              "random_state"),
            bootstrap=bootstrap,
            oob_score=oob_score,
            n_jobs=n_jobs,
            random_state=random_state,
            verbose=verbose,
            warm_start=warm_start)

        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.min_samples_leaf = min_samples_leaf
        self.min_weight_fraction_leaf = min_weight_fraction_leaf
        self.max_features = max_features
        self.max_leaf_nodes = max_leaf_nodes

If this is something you think it might be interesting, I would be more than happy to help with a proper PR request.

james-sexton96 commented 1 year ago

Since this was posted, there's a growing literature suggesting that the time-varying nature of some features would necessitate alternative splitting strategies in RSF's.

Having only a single strategy (log-rank) that is subject to some of the same proportionality assumptions of a Cox Regression might defeat the purpose of a model ideally designed for non-linear problems.

Having at least one alternative option like a Poisson regression log-likelihood could offer an intermediate solution before open-ended splitting strategies become available.

See the following examples of varying splitting strategies:

sebp commented 1 year ago

@james-sexton96 The options for the splitting rule is quite large in the literature. I haven't followed closely the last couple of years, so I'm not sure if a consensus emerged by now. Conditional Inference Forests would definitely be interesting (see #341).

Do you have a reference for the Poisson regression log-likelihood you mentioned?

james-sexton96 commented 1 year ago

@sebp Sure thing. See references below.

A poisson regression log-likelihood is well suited for real-world data as opposed to data with structured follow up. There was an attempt to branch the R package randomforestSRC's survival functionality (RF-SLAM paper by Wongvibulsin below). However, both this branch and the original package appear to be unsupported.

It would be nice to mirror sklearn's random forest regressor's parameters by including a kwa for criterion, and if I have time, I can draft an implementation of a poisson split criteria!

Crowther et al. 2012 Autsin P. 2017 Wongvibulsin et al. 2019

james-sexton96 commented 1 year ago

See also, poisson criteria added to sci-kit learn